Notes from Nagios World Conference 2012

Nagios World Conference 2012 was held between Sep 23 and Sep 28th at St. Paul, MN. I represented Mozilla IT/SRE along with Sheeri Cabral, who spoke about MySQL plugins. I wanted to share some observations and my best takeaways from the conference. I attended 11 talks in all and spoke to various speakers and other attendees about our setup and some tricky monitoring problems we’ve had to face.

My list of items to address at the conference were:

  1. Discuss our Nagios setup – pros and cons. Do we have a sound setup? Do we use best practices and abide by guidelines, etc. How do we compare to other setups of similar magnitude?
  2. Are there good ideas to reduce the number of actionable as well as unactionable alerts?
  3. How are we scaling up? Can we scale better? Are there any tools that will help us deal with problems of scale?
  4. Are we harnessing Nagios to the best of its capabilities? Are there more features that Nagios can provide that we lack?
  5. General monitoring best practices and ideas.

Here are my notes for each of the above:

  1. (Pros/Cons) Our Nagios setup is pretty well thought out and aligns very well with the best practices and recommendations that were suggested to the audience. Amongst the suggestions were to use hostgroups as much as possible over configuring hosts individually, separate out instances for better scalability (which we do on a per-DC basis), setup the Nagios servers on highly available clusters (most of our instances are VMs, which inherently provide availability and reliability), use chef/puppet/…, “play” with checkcommands instead of adding more plugins for similar functionality (see the number of check_http derivatives and our use of “-p <port number>” in http service checks) among others. There were some specific performance tuning tips (liveblog) that were given and I will work on implementing them.
  2. (Reduce noise) This vastly depends on infrastructure stability. My assessment is that Mozilla IT is in a state of upward scale and it is difficult to “stabilise” in the traditional sense. I think it will take us a year or two to get to a state where most of our alerts mean real problems not not just potential problems or the ones that could cause problems if left unacknowledged over a period of time. This also requires an overhaul of service and operational SLAs. tl;dr we’ll get there in time. We can surely work on reducing unactionable alerts in the meantime and that should be a significant goal.
  3. (Scale) We are scaling up pretty well. Our idea of an instance-per DC works very well. We are presently missing some form of an aggregation (think MNTOS, but better. Thruk maybe?), if at all. We currently monitoring less than 1500 hosts per instance and the (mostly) stock configuration has held up pretty well. As such I don’t find a lot of benefit in investing much time and effort into scaling (see Mod Gearman).
  4. (Utilizing Nagios) We are using Nagios very very well. Between two-way communication using IRC bots and the assortment of alerting mechanisms we have, we are making pretty good use of what Nagios provides and adding more to it. I believe we’re still far from hitting the limits of what Nagios can do and even then we can leverage the tools that work with Nagios to extend it further (addons, plugins like check_mk and multisite). The beauty of Nagios lies in its simplicity. It does not have every. single. thing. a monitoring solution should have. Rather, it follows the UNIX philosophy of “do it less, do it well”. ’nuff said!
  5. (Best practices) We are missing some good-to-have features such as performance graphing of Nagios itself as well as everything it monitors, analytics, trending (one word but super important, can make a world of difference. /me waves to Corey Shields who had proposed this earlier this year) and other good-to-have things. This is going to be an ongoing goal and something we should refine all the time. It’s also a little hard to work towards since defining specific items are itself tricky.

We were also introduced to Nagios Incident Manager (liveblog), an upcoming product which integrates with Nagios and is useful for handling high severity incidents. I would like to see this setup, since Bugzilla might not always be the best tool for such cases.

Lastly, I realized more than once that we have a kickass setup at Mozilla. Particularly, our puppet module is awesome and utilizes many sound principles. I strongly believe there is a *lot* of value in open-sourcing our manifests and templates. This is surely going to help a whole lot of people and earn us tons of goodwill amongst IT.

The guys from Nagios are great too – a good pack of developers and it’s very interesting to hear about monitoring crazy stuff, like Microsoft Word macros(!). Overall, I’d recommend being part of the Nagios Conferences if you’re serious about monitoring your stuff (and you should be)!

1 response

  1. Steve Fink wrote on :

    My insanely biased opinion based on too little experience to be taken seriously: definitely look into check_mk and multisite, but “…we can leverage the tools that work with Nagios to extend it further…” doesn’t begin to describe what it’d be like if you took check_mk seriously.

    Nagios is crap. It’s the best crap around, or at least was for a very long time, but it’s a ridiculous steaming pile of unmanageable complexity. Its configuration makes sendmail look sane. It is constructed from a loose collection of features that are mostly minor bandaids to deeper underlying problems. You can spend days/weeks/months tweaking and tuning and feeling like you’re making a lot of progress, but you don’t realize you’re really painting an elephant with a toothbrush. While it’s rolling in mud.

    check_mk and multisite honestly kind of suck too, but they are based on a sane architecture and have the potential to be built into something really good. Time spent on them is way more productive — you can actually set things up to fix problems for real instead of constantly adding bandaids for individual symptoms. It takes some getting used to now that you’re painting the elephant with a spray gun instead of a toothbrush — it’s sometimes harder to get the hot pink paint just on the toenails without splattering the surrounding area a bit — but once you get the hang of it even the fine-tuning ends up easier than with regular nagios.

    It’s nice that you can start out by just adding check_mk alongside regular nagios checks, but eventually you’ll want to put it in charge and do everything its way.

    The one design decision that I disagree with is that the check_mk client agent can’t accept input from the server. It’s easy enough to hack together one that can — I did it — but then you have to worry about backwards compatibility with plain agents. (I couldn’t open up a different port to reach them, so I had to make it work with either the old or new agent, and it was easy to cause a hang with a new agent waiting for input that never comes.) Example use: I was on a warpath to reduce the noise in the signal, and in particular I heavily relied on check dependencies to minimize alert storms. For testing network reachability, it’s nice to be able to tell the agent what IPs to try to reach, given centralized knowledge of which monitored hosts were part of what network. I also had some expensive checks that I only wanted to run on certain host types, but didn’t want to lose the huge management gains of check_mk’s passive setup.

    Sorry, that’s all vague and unnecessarily opinionated. It’s been a while, too. But check_mk allowed me to get to the point where I was worrying about setting up different CPU utilization thresholds for different types of servers (as in, if I got a high CPU alert, there was a very very good chance it was a real problem) and embedding initial diagnostic procedures into the monitoring infrastructure (instead of just knowing that host X was unreachable, I would know exactly where in the convoluted network path to that host the connectivity was being lost). Oh, and it gave lots and lots of perf stats to graph and trend and show to people in suits. Doing all that with bare Nagios would have been a nightmare. (Or rather, it *was* a nightmare, and didn’t work, which is why I switched to check_mk across the board.)