Liveblog: Managing Your Heroes: The People Aspect of Monitoring

At the Nagios World Conference, North America Alex Solomon of PagerDuty talked about Managing Your Heroes: The People Aspect of Monitoring.

First he goes over some acronyms:
SLA – service level agreement
MTTR – mean time to resolution – avg time it takes to fix a problem
– also mean time to response, a subset of MTTR – avg time it takes to respond.
MTBF – mean time between failures

How can we prevent outages?
– look for single points of failure (SPOFs) – engineer them to be redundant, if you can’t live with the downtime.
– a complex, monolithic system means that a failure in one part can fail another part. e.g. if your reporting tool is heavily loaded, that will affect your customers who want to buy your product, if your reporting tool and sales system go against the same machines.
– systems that change a lot are prone to more outages
– Outages WILL happen

Failure lifecycle:
– monitoring -> alert -> investigate -> fix -> root-cause analysis, and from here it could go back to any of the other stages. The line between fix and investigate is blurry, because you might try something in the course of investigation and it might actually fix it.

Why monitor everything?
Metrics, metrics, metrics. “If it’s easy, just monitor it. You can’t have too many metrics.”

Tools – for internal, behind the firewall – Nagios, Splunk. Exetrnal – New Relic, Pingdom. Metrics – graphite, data dog.

Severities – based on business impact
sev1 – large scale business loss (critical)
sev2 – small to medium business loss (critical)
sev3 – no immediate business loss, customer may be impacted
sev4 – no immediate business loss, no customers impacted

Each severity level should have it’s own standard operating procedure (SOP)/SLA:
sev1 – major outage, all hands on deck. Notify the team via phone/sms, response time 5 min
sev2 – critical issue – notify the oncall person/team via phone/sms, response time 15 min
sev3/4 – non-critical issue notify on-call person via e-mail, response time next business day.

Severities can be downgraded/upgraded.

Alert *before* systems fail completely.

Oncall best practices – have a cellphone for phone calls and SMS. You might want to get a pager, but the paging system isn’t necessarily reliable either. A smart phone is better, because you can then handle the problem from the phone. 4G/3G internet – like a 4g hotspot, a USB modem, or tethering.

Set up your system so it pages multiple times until you respond. Escalate to different phones as needed. Get a vibrating bluetooth bracelet if you sleep with someone else and they don’t want to be disturbed.

Don’t send calls to the whole team if one person can handle it. You wake everyone up, the issue could be ignored by everyone or duplicated by everyone.

Follow-the-sun paging, oncall schedules.

Measure on-call performance, measure MTTR, % of issues that were escalated, set up policies to encourage good performance. Managers should be in the on-call chain, and you can pay people extra to do on-call. Google pays per on-call shift, so people actually volunteer to be on-call.

NOCs reduce the MTTR drastically. Expensive (staffed 24×7 with multiple ppl). But you can train your NOC staff to fix a good percentage of the issues. As you scale, you might want a hybrid on-call approach – NOC handles some, teams directly handle others.

Automate fixes and/or add more fault tolerance.

You need the right tools (monitoring tools were mentioned before). Soft tools:
voice – conference bridge / skype / google hangout
chat – hipchat, campfire [we use IRC at mozilla]

Best practice: having an incident commander – provides leadership and is in charge of the situation. prevents analysis paralysis, instructs who should do what, etc.