Liveblog: Nagios and Another Layer of Indirection

John Sellens presents Nagios and Another Layer of Indirection at the Nagios World Conference. PDF slides are here.

“All problems in computer science can be solved by another level of indirection” – David Wheeler

Nagios Constitution: Separation of Core and State

There are separate components and interfaces, and they’re well-defined. This separation allows us to subvert how they’re supposed to be used and do whatever we want.

Nagios if well-documented, that’s one of the major strengths!

Where is there indirection in Nagios?

Favorite plugin is the negate plugin – in the official nagios plugins.

Remote checking adds another layer – between a local plugin and the nagios server – check_by_ssh, check_nrpe, check_snmp

Another layer of indirection – graphing. Apan was the original nagios grapher, but now there’s performance data and plugins and event brokers that will get the data.

How can we implement indirection?
“Unix Philosophy: Write programs that do one thing and do it well” – Doug McIlroy.

Add plugin timeout with timeout (a unix program). Write a wrapper around an existing plugin. You can do multi-stage checks, e.g. is at least one interface up? Is at least one web server up? You can use expect for interactions, or use webform posting tools.

You can make a “pervasive wrapper” – e.g. it changes *everything* – e.g. change the value of $USER1$ to be /usr/local/mywrapper /usr/local/libexec/nagios.

Custom object variables in the environment –
_web_regexp SomeRegExp
or
NAGIOS__HOSTWEB_REGEXP

Environment macros means your plugins can know everything from the external and internal environment.

Try to avoid per-machine configs, but try to make them simple. e.g. to add a new webhost or db machine, want to make the changes as small as possible. Use hostgroups, sevices, etc. so all you need to do is add a host definition, including setting critical and warning variable values.

Smarter plugins
Make wrappers that change based on time of day – e.g. if it’s off-hours, report good, because it doesn’t matter. You could use a timeperiod for that, or you could hard-code it into the plugin. John made a plugin that says “what storage do you have, and is it good.” So you don’t have to make a separate /data check or whatever. Or, the plugin can assume that the ﬁrst observed state is “normal” and complain if it changes.

Principle: derived thresholds – dynamically adjust thresholds based on time of day, based on trends and past experience, based on other current state/activity.

Let machines make the configurations, not you.

How else can we use these principles?
Define exec commands in snmpd.conf – check_snmpexec gets a table of everything that’s available, looks for the number of the description that matches (“mysql”), and uses that. Then you don’t need to know the number:

check_snmpexec host snmpcomm execname

Get Service: check_winservices – uses files to know what should be running, and gets the running services, and complains if the files don’t match.

mbdivert – for a certain machine, route this way (e.g. hop through this ssh server)

..and more examples. Check the slides….John’s a really smart guy and knows how to use and abuse Nagios, in the good ways!