CPAN Testers Summary - May 2010 - Relayer
Further beta testng has been ongoing over the last week and all seems to be working well. The metabase search problem, although not resolved completely, now has a mechanism in place to avoid blockages. The feed started up again last week and after 3 days had caught up with a month's worth of reports submitted to the metabase. Although this is still in the beta phase, it does give us more confidence that the eco-system can cope with the submission levels we have been seeing in the last year, and has the capacity to cope with much more. Further tests are being carried out, but the days of CT1.0 are definitely numbered now.
Another big news item for CPAN Testers last month, was once again Microsoft's aggressive behaviour towards us. A number of commenters on various news threads have the view that the company are unlikely to have specifically targeted us. However, it should be noted that the project lead contacted me this time around, and stated that the logs and suggestions I sent back in Januray, on their request, had been used to modify the bot to be less aggressive towards sites like ours. So we weren't exactly unknown to them. This time around, I've again sent logs and suggestions, but have so far not received a word of response. A pity really, as a thank you at the very least for the data, might have helped with damage limitation.
One thing that I have discovered in all this, is that within Apache configuration files order is significant. Previously, blocking the Microsoft IPs was also blocking the robots.txt file, which I have since fixed. I recently spoke with Alex Chudnovsky, Managing Director of Majestic 12 (their bot, like many others, crawl the site respecting our bandwidth), regarding this incident, and he explained that the specification for handling robots.txt, states that robots should treat a 403 response for the robots.txt file itself, as if the file contained a rule to disallow the bot at the root level. In fact quoting the specification directly:
"Specific behaviors for other server responses are not required by this specification, though the following behaviours are recommended:
- On server response indicating access restrictions (HTTP Status Code 401 or 403) a robot should regard access to the site completely restricted."
As the msnbot had been getting that for 4 months, there really is no excuse. As it is, the msnbot is still requesting robots.txt several hundred times a day, which is still overkill and far too aggressive. Until I see much less aggression, they will not be unblocked.
Two further updates to the Statistics website were unveiled last month, the first included some CPAN Milestones for the submissions to CPAN on the Statistics of CPAN page. Brian Cassidy caught me on IRC and was the first to ask who was the first, and although I suspected the answer, I was pleased that Andreas was able to give a nice insight into how CPAN started. It is quite something to think that in the last 15 years, 35 distributions have now grown into over 20,000 distributions, many being used daily around the world.
The second update relates to those authors who have email addresses that are now out of date, possibly through moving to new jobs or new personal domains. While the Daily Summaries can easily be switched off to avoid sending emails that can bounce back, it is not so easy for users who want to get in touch with an author. It was for the latter reason that the Missing In Action page has been added. There are only a small number of bouncing addresses, but hopefully this will alert and encourage authors to keep their PAUSE accounts up to date.
I'm not recording new testers and mappings currently, as the mechanism to do this is changing, and will feed into the new Administration site that now needs to use the Metabase as well as the older mechanism. Originally this site was going to be launched in January, but was put on hold as the CT2.0 work became more urgent. Work will begin again on that very soon.
So that's all the news this month. Hopefully we should have more definite news regarding the launch of CT2.0 this time next month. In the meantime happy testing :)
Cross-posted from the CPAN Testers Blog.