Monitoring according to roles, not category tags
For a while I've had very little to write because I've had very little time to speak. It makes sense when you think of it. :)
We had a lot of changes at $work recently. I've received a new position (as a team manager) but I'm still retaining my "Perl Ninja" title, and I've also had a birthday celebrated at the $office and was given a makeshift rubberband gun, crafted by Tamir Lousky. :)
Another big change (that is the brainchild of Mr. Lousky) is monitoring according to roles and not category tags, which is (at least according to the title) the subject of this post.
One of the most important things to a sysadmin is the ability to monitor her servers. I've created Data::Collector to help collect information about a given server. It's extremely flexible and can be run as a separate app, and returns information in a plethora of formats (XML, JSON, YAML) which you can add to, or to a given task. It supports plugins for custom information bits ("infos") that it can collect for you.
While this closed off one part (the fetching of information), the second part of it is continuous monitoring of that state. I've written before about Nagios, which we use - though I'm tempted on trying Shinken when I get the chance (Jean is a very nice person, and the project is very compelling).
In the early days, we used to decide manually what tests would run on each server. That was when I worked at a different job which had roughly 300 servers, but they haven't changed frequently. Now I'm managing less, but they frequently change. Either in job or role or they are simply terminated and replaced.
This got us trying a different route. I developed an inhouse program that gaps between the need to create a Nagios configuration file manually and maintaining a database. It basically allows you to hook up the database, present a configuration (in YAML, to us) and it will create the proper Nagios configuration. It even allows some things Nagios natively does not using some smart trickery (and not black magic).
We created many categories, which each presented something that should run on the machine. It could be a "webserver" category, a "mta" category and so on. These would be translated to actual Nagios service tests (using the YAML configuration file) and then a proper Nagios configuration file was created.
However, we've learned the hard way that this forces us to constantly be "up and up" with the different "jobs" a server can have, since any "job" is a really small description of a task. If a server now needed to block SMTP, we would have to change some of the categories (making Nagios check that it does indeed block it). This proved to be very difficult when our "jobs" change so frequently - an issue we're in the midst of solving as well.
I wrote another application (which I wrote Algorithm::Diff::Callback for) that uses Data::Collector and scans for changes on servers and updates the database. The problem was that if a server had Apache go down (which happens, unfortunately), the scan would show that it isn't running and would remove it from the database. This would automatically warrant the HTTP test removed from the test specs of that particular server. Knowing this situation could happen, we never ran this automatically.
At some point, Tamir suggested deciding on a role - one specific role - for each server, that would implicitly determine all the different tasks it would have and all the tests for these specific tasks. A really great idea, that we knew we didn't have time to get done.
Recently, with my promotion, I've decided to take initiative and once we secured the specs for this "roles", and additional "meta-roles" allowing more metadata against the specific roles to allow more finer-grained tests (using Perl and Test::More), we set about doing this.
I re-wrote the part of our Nagios integration code to work against these roles, and re-wrote the configuration specs of it to align it with the roles idea. Tamir went over all servers and added the appropriate metadata ("roles" and some "meta-roles") and yesterday we finished the operation. Today we'll be propagating different iptables policies and securing the Nagios configuration on a new secure encrypted server and plug it in to our Emails and SMS system.
The future is looking very bright. A post on our Perlbal solution might follow. :)