Nagios World Conference 2012 was held between Sep 23 and Sep 28th at St. Paul, MN. I represented Mozilla IT/SRE along with Sheeri Cabral, who spoke about MySQL plugins. I wanted to share some observations and my best takeaways from the conference. I attended 11 talks in all and spoke to various speakers and other attendees about our setup and some tricky monitoring problems we’ve had to face.
My list of items to address at the conference were:
- Discuss our Nagios setup – pros and cons. Do we have a sound setup? Do we use best practices and abide by guidelines, etc. How do we compare to other setups of similar magnitude?
- Are there good ideas to reduce the number of actionable as well as unactionable alerts?
- How are we scaling up? Can we scale better? Are there any tools that will help us deal with problems of scale?
- Are we harnessing Nagios to the best of its capabilities? Are there more features that Nagios can provide that we lack?
- General monitoring best practices and ideas.
Here are my notes for each of the above:
- (Pros/Cons) Our Nagios setup is pretty well thought out and aligns very well with the best practices and recommendations that were suggested to the audience. Amongst the suggestions were to use hostgroups as much as possible over configuring hosts individually, separate out instances for better scalability (which we do on a per-DC basis), setup the Nagios servers on highly available clusters (most of our instances are VMs, which inherently provide availability and reliability), use chef/puppet/…, “play” with checkcommands instead of adding more plugins for similar functionality (see the number of check_http derivatives and our use of “-p <port number>” in http service checks) among others. There were some specific performance tuning tips (liveblog) that were given and I will work on implementing them.
- (Reduce noise) This vastly depends on infrastructure stability. My assessment is that Mozilla IT is in a state of upward scale and it is difficult to “stabilise” in the traditional sense. I think it will take us a year or two to get to a state where most of our alerts mean real problems not not just potential problems or the ones that could cause problems if left unacknowledged over a period of time. This also requires an overhaul of service and operational SLAs. tl;dr we’ll get there in time. We can surely work on reducing unactionable alerts in the meantime and that should be a significant goal.
- (Scale) We are scaling up pretty well. Our idea of an instance-per DC works very well. We are presently missing some form of an aggregation (think MNTOS, but better. Thruk maybe?), if at all. We currently monitoring less than 1500 hosts per instance and the (mostly) stock configuration has held up pretty well. As such I don’t find a lot of benefit in investing much time and effort into scaling (see Mod Gearman).
- (Utilizing Nagios) We are using Nagios very very well. Between two-way communication using IRC bots and the assortment of alerting mechanisms we have, we are making pretty good use of what Nagios provides and adding more to it. I believe we’re still far from hitting the limits of what Nagios can do and even then we can leverage the tools that work with Nagios to extend it further (addons, plugins like check_mk and multisite). The beauty of Nagios lies in its simplicity. It does not have every. single. thing. a monitoring solution should have. Rather, it follows the UNIX philosophy of “do it less, do it well”. ’nuff said!
- (Best practices) We are missing some good-to-have features such as performance graphing of Nagios itself as well as everything it monitors, analytics, trending (one word but super important, can make a world of difference. /me waves to Corey Shields who had proposed this earlier this year) and other good-to-have things. This is going to be an ongoing goal and something we should refine all the time. It’s also a little hard to work towards since defining specific items are itself tricky.
We were also introduced to Nagios Incident Manager (liveblog), an upcoming product which integrates with Nagios and is useful for handling high severity incidents. I would like to see this setup, since Bugzilla might not always be the best tool for such cases.
Lastly, I realized more than once that we have a kickass setup at Mozilla. Particularly, our puppet module is awesome and utilizes many sound principles. I strongly believe there is a *lot* of value in open-sourcing our manifests and templates. This is surely going to help a whole lot of people and earn us tons of goodwill amongst IT.
The guys from Nagios are great too – a good pack of developers and it’s very interesting to hear about monitoring crazy stuff, like Microsoft Word macros(!). Overall, I’d recommend being part of the Nagios Conferences if you’re serious about monitoring your stuff (and you should be)!