Monitoring according to roles, not category tags

For a while I've had very little to write because I've had very little time to speak. It makes sense when you think of it. :)

We had a lot of changes at $work recently. I've received a new position (as a team manager) but I'm still retaining my "Perl Ninja" title, and I've also had a birthday celebrated at the $office and was given a makeshift rubberband gun, crafted by Tamir Lousky. :)

Another big change (that is the brainchild of Mr. Lousky) is monitoring according to roles and not category tags, which is (at least according to the title) the subject of this post.

One of the most important things to a sysadmin is the ability to monitor her servers. I've created Data::Collector to help collect information about a given server. It's extremely flexible and can be run as a separate app, and returns information in a plethora of formats (XML, JSON, YAML) which you can add to, or to a given task. It supports plugins for custom information bits ("infos") that it can collect for you.

While this closed off one part (the fetching of information), the second part of it is continuous monitoring of that state. I've written before about Nagios, which we use - though I'm tempted on trying Shinken when I get the chance (Jean is a very nice person, and the project is very compelling).

In the early days, we used to decide manually what tests would run on each server. That was when I worked at a different job which had roughly 300 servers, but they haven't changed frequently. Now I'm managing less, but they frequently change. Either in job or role or they are simply terminated and replaced.

This got us trying a different route. I developed an inhouse program that gaps between the need to create a Nagios configuration file manually and maintaining a database. It basically allows you to hook up the database, present a configuration (in YAML, to us) and it will create the proper Nagios configuration. It even allows some things Nagios natively does not using some smart trickery (and not black magic).

We created many categories, which each presented something that should run on the machine. It could be a "webserver" category, a "mta" category and so on. These would be translated to actual Nagios service tests (using the YAML configuration file) and then a proper Nagios configuration file was created.

However, we've learned the hard way that this forces us to constantly be "up and up" with the different "jobs" a server can have, since any "job" is a really small description of a task. If a server now needed to block SMTP, we would have to change some of the categories (making Nagios check that it does indeed block it). This proved to be very difficult when our "jobs" change so frequently - an issue we're in the midst of solving as well.

I wrote another application (which I wrote Algorithm::Diff::Callback for) that uses Data::Collector and scans for changes on servers and updates the database. The problem was that if a server had Apache go down (which happens, unfortunately), the scan would show that it isn't running and would remove it from the database. This would automatically warrant the HTTP test removed from the test specs of that particular server. Knowing this situation could happen, we never ran this automatically.

At some point, Tamir suggested deciding on a role - one specific role - for each server, that would implicitly determine all the different tasks it would have and all the tests for these specific tasks. A really great idea, that we knew we didn't have time to get done.

Recently, with my promotion, I've decided to take initiative and once we secured the specs for this "roles", and additional "meta-roles" allowing more metadata against the specific roles to allow more finer-grained tests (using Perl and Test::More), we set about doing this.

I re-wrote the part of our Nagios integration code to work against these roles, and re-wrote the configuration specs of it to align it with the roles idea. Tamir went over all servers and added the appropriate metadata ("roles" and some "meta-roles") and yesterday we finished the operation. Today we'll be propagating different iptables policies and securing the Nagios configuration on a new secure encrypted server and plug it in to our Emails and SMS system.

The future is looking very bright. A post on our Perlbal solution might follow. :)


Have you considered using something like Puppet or Chef to manage the configuration of your different servers? You could easily couple the Nagios config generation to the config management that way, so that when you changed the role of a server, the new config would get pushed to the server and the Nagios config would get updated accordingly...

Just a couple of other things you might be interested in as well.

Reconnoiter as a replacement for Nagios. Not only does it scale better than Nagios but it also has trend analysis and tracking like Cacti, but smarter. Seems like a pretty neat project.

Also, I've been trying to find a better replacement for Chef or Puppet myself. Puppet seems too limiting to just use their DSL and Chef requires your sysadmins to be Ruby and Merb programmers instead of just sysadmins. So if anyone started a Perl project that would be better than both of those, I'd contribute.

A couple of notes on what we do (everyone must do monitoring differently - just a fact of life!)...

We switched from Nagios to Opsview which is a free repackaging of nagios using a Catalyst front end and Mysql backend, with a small XML API. It's a nice product and worth checking out.

In our home-grown inventory tool we assign network interfaces to hosts, and then assign one or more roles to the interfaces, for example ssh or https etc. A cron job injects host and service check config to the Opsview API from this data. Some roles map to checks emitting e-mail alerts and others don't (dev host roles).

Maybe these thoughts are useful, thanks for the good post.

Leave a comment

About Sawyer X

user-pic Gots to do the bloggingz