CPAN Testers Summary - January 2010 - The Wedge
One month to go and work is progressing well on the transformation to CPAN Testers 2.0. Over the last month many changes to the websites have been visible, but just as many changes have been happening behind the scenes. The Metabase is a key part of the transformation, and although work has been going well, it is reaching the point where it'll need some serious testing prior to switch over on 1st march 2010. If you have the time, please join the cpan-testers-discuss mailing list or contact David Golden to let him know what you can help with. See David's CPAN Testers 2.0 mid-January update blog post for a more detailed status update.
In order to reduce the load on the perl.org servers, after the announcement of the switch over, the CPAN Testers agreed to back off their smokers to ease the pressure on the perl.org mail servers. The cpan-testers mailing list is a very high volume list, and takes up a lot of resources to manage it. Many of the testers throttled back their smoke bots and we did see a dramatic reduction in test reports being submitted. We were aiming for around 5,000 a day maximum. Within a day or two we were successfully below the target.
However, not all went well. One smoker bot suddenly appeared to go AWOL, and the tester didn't seem to be responding to direct requests to throttle the smoker. Worse still the bulk of reports being produced were bogus. While some PASS reports got through, most were failing due to what appears to be a bad combination of environment and old toolchain software. As this was now polluting the pool of reports at a considerable rate (for every good report submitted, 1 or more was submitted by the bad smoker), something needed to be done to reduce or halt the effects. Several authors were rightly concerned that this would make their distributions look bad on the CPAN Testers Reports site. Thankfully, a new site (more on the later) is in the works which will make this easier to manage, but in the interim a further measure was put in place. I now have the ability to blacklist runaway smokers, by invalidating their reports as the come in. This then means the reports are ignored by the Reports site, the Statistics site and the rest of the eco-system. I also manually marked all the smoker's reports during January as invalid.
It transpires that the tester was on holiday and had started off his smokers before he went, without checking to ensure the reports they were sending were valid. Once this tester has upgraded to use the right tools, I'll remove him from the blacklist. However, it is good to know that we can now quickly stop any future runaway smokers before they can do much damage to the reports and statistics.
Normally one story would be the only excitement of a month, but there was more to come. On 17th January, the CPAN Testers server started show effects of being under attack. In the early hours of 18th January, the server locked up, and required manual intervention to reboot it. Once back online, an investigation through the logs revealled that the MSNBot, as used by Microsoft, had been hitting the server at a rate of knots. In fact, so much so, that the logs began rapidly filling up again after the reboot. After initially blocking the range of IPs, which grew as the day went, I wrote an article and posted to the CPAN Testers blogs to warn anyone who might be using the CPAN Testers server. Little did I realise that the story would spread like wildfire around the world on numerous IT related news networks and blogs! I did get an apology from someone representing the Bing team, but it should never have got to that. Reading many of the comments on various blogs, although a small minority took delight in having a kick at Perl, the majority of posts were in support of the ban, and many even had their own experiences. While I may have been the first to shout loudly, CPAN Testers definitely weren't the first to be knocked out by Microsoft.
Over a week later, with the ban still in place, and the robots.txt changed to ban all access to msnbot, every hour now the msnbot blasts the server for about 5-10 minutes at the rate of between 4-8 requests a second, mostly from the same 2 IP addresses. So even after banning the bot (it gets a 403) and having an apology from Microsoft, the bot still hasn't learnt to get itself under control. If Microsoft ever want people to take Bing seriously as a search engine, then they need to start acting responsibly, otherwise they are likely to find themselves banned from a good portion of the internet.
One thing I would like to make clear about the incident, is that all the monitoring of the server is done completely voluntarily. Over the last month this has taken up a lot of spare time, which often wasn't there to begin with. However, the server itself is *NOT* a High-Availability setup, and is *ONE* server on its own. No redundancy (apart from the RAIDed disks) with the web server, database and processing tools all sharing the same physical hardware. If it takes 2 seconds to return a web page, its likely that the server is under considerable load to process incoming reports, running backups or generating web pages, RSS feeds, JSON/YAML files to keep the rest of the eco-system (including CPAN/CPANPLUS, search.cpan.org, etc) able to keep up to date. Taking it out of action is not something that is taken lightly.
The original post was perhaps rather emotionally put together, and I apologise to anyone who may have got caught in any flak for that. However, I had just woken up and spent much of the morning trying to get the server back online while getting the kids ready for school and heading out to work! With it being a Monday morning too, hopefully it was understandable that a rant ensued. I'll be taking several deep breaths, if (though hopefully not) it happens again!
As mentioned earlier in the post, and in the previous summary, I did plan to release a new site during January. The CPAN Testers Administration site is still planned to go live, just not yet. With all the changes to the underlying software for CT2.0, there are some changes required for the Administration site that also need to be done. As this isn't live yet, I now consider it a low priority to getting CT2.0 completed, and will now wait until after CT2.0 has gone live, before finishing off the release.
In the last weekend of January, the biggest changes to the current databases went into effect, with the Metabase GUID now being used. Although the full extent of the change won't be seen until we're using the Metabase for submitting reports, this first shift is an important one. There were a few glitches as I brought the processing tools back online, as I soon discovered little parts that were affected by the change that I hadn't anticipated. Thankfully the errors were minor and all were quickly fixed. The server is now catching up on processing from the weekend, and I anticipate all will be back to normal service within the next day or two.
To reduce the processing load, as mentioned in a previous post, the database backups are now happening a little less frequently. The CVS backups have now been disabled, with the uploads and release databases both backed up once a day (usually between 00:30 and 02:30 Central European Time). The cpanstats database is currently backed up once an hour, but seeing as the bzip version seems to be only popular with a few people (one being Yahoo! Slurp :)) and only downloaded at most once a day by any single IP (including Yahoo! Slurp ... see Microsoft, some search engines can get it right), I'm considering only generating the bzip version only once a day. I'll watch the logs and see if there are any changes, but if aren't I will likely adjust the backups inline with current requests for the files.
Along with the backup changes, various other daily server processes have been reviewed and many have been rescheduled to reduce server load. The end result has been to reduce the nightly overheads and hopefully the server will be in a better position to process reports once the CPAN Testers switch to the Metabase and unleash their smokers from the current limiters.
Last month we had a total of 162 tester addresses submitting reports. The mappings this month included 21 total addresses mapped, of which 7 were for newly identified testers. Another low mapping month, due to work being done on CPAN Testers as a whole.
A long summary this month, but then a lot has been happening. Expect updates throughout the month as various parts of CT2.0 undergo testing, and we start to see the results of all the hard work of the past couple of years. The future is nigh.
Cross-posted from the CPAN Testers Blog.