Curating CPAN sometimes means really deleting stuff
Regularly, CPAN authors are reminded that CPAN is a large collection of files, mirrored all over the world, and that it would be nice to keep its size reasonable.
Over the last twelve years, we've seen regular calls to remove old versions of distributions from CPAN. Because I'm too lazy to look for more, I'll point to the earliest and the most recent I could find.
My topic for today is not just removing old releases, but actually removing all the versions of a published distribution from CPAN (thus making it disappear entirely, except from BackPAN).
Here are some good reasons to do this:
- the module was an experiment and the experiment failed
- the module serves no purpose any more
- the module is interfacing with an online service that has disappeared or radically changed (i.e. the other half of the equation has disappeared entirely -- this is different from interfacing with an obsolete library: old code never dies)
- it will live forever on BackPAN -- in suspended animation
Having lost interest or not being able to maintain the code any more are not valid reasons to remove a distribution from CPAN, but they are good reasons to make ADOPTME a maintainer.
So, if you've got an old distribution that doesn't belong on CPAN any more, maybe you could consider removing it on CPAN Day?
Below is the story of the three times I have removed a distribution from CPAN entirely.
At the beginning of the millenium, I got involved in Perl Golf (see ASAVIGE's The Lighter Side of Perl Culture (Part IV): Golf for more than you ever want to know about Perl Golf), and in 2002, I worked with JQUELIN and others on a module that was supposed to help running and testing golf entries. After a few months, it became clear that our project was too ambitious, that we lacked the time and motivation to make it work properly, and more importantly that another program was already widely used to perform the same role.
When that became really obvious, I made an announcement of availability of the namespace. Some time later, I removed all the releases from CPAN, and actually dropped all permissions I had on the namespace.
While writing this post, I digged up the old CVS repository, so I might convert it to Git and put it on GitHub at some point. (Because old code should not die, but instead be placed in suspended animation for youngsters to point and laugh.)
What do you do when a web site is useful but it's UI is terrible? You scrape it and perform the interesting moves in your code! There is a website where you can buy and sell used (and new) good (books at first, but now just about anything). The difference with eBay is that there are no bids. The sellers set theirs prices, the site takes its cut, and the buyer pays some extra for postage.
The big drawback is the extra postage costs. However, you can reduce that by buying several items from the same seller. How hard would it be to take your wishlist and point out when several items on it are sold by the same user? (I'm pretty sure it would increase conversion.) Mind you, the SQL query at the heart of this feature probably wouldn't need more than two JOINs.
Because at some point we had a large wishlist and we wanted to save some of the postage costs (and because it was fun), I wrote a scraper that would find all the sellers for all the goods in our wishlist, and tell me when someone was selling more than one item we wanted. Actually, the library code was more about fetching each type of page (item, seller, wishlist) and hand the data to the user. And then I used that data to find sellers that had more than one item on our wishlist.
Soon after I released version 0.01 of the module on CPAN, one of their developers contacted me on IRC (we had chatted on IRC in the past, and he remembered me), and told me that the company was angry enough about my scraping code, that they were on the verge of unleashing their lawyers after me... It seems their great issue at the time was other companies scraping their site (book notices they paid for) and they didn't want me to help those with my open source code. So I removed the modules from CPAN (I even contacted the PAUSE admins to speed up the deletion), and sent an email to the company explaining the steps I had taken to remove the code. BackPAN is obscure enough that you can only find what you know is already there.
No lawyer came after me. Before you ask, I had no interest in trying my luck at fighting, when I was probably violating the terms of service anyway. The crappy website UI remains. People still spend more than then need to. We don't buy as much stuff there as we used to.
Heavens-Above is a website "dedicated to helping people observe and track satellites orbiting the Earth without the need for optical equipment such as binoculars or telescopes" (says its Wikipedia entry).
Back in 2002, it was using its own geodatabase to help people set their viewing location. One of my friends was looking to get the list of all city names in the world, and pointed me at the site, asking for the list of all city names (in France, at first). So I wrote a scraper script for getting the data out of the search box, which I later turned into a module.
The one interesting issue I remember from scraping the site is that the search results were
cut at 200 answers. So to get all the cities in a country, I'd start with the
* request, detect if there were more than 200 answers, take the last answer, and update the search query. So queries would be
aa* (think cities in the Netherlands), etc. When there were less than 200 answers, it was time to backtrack, and so move from
ab*. And so on, until the whole list of cities had been exhausted. Looking at the last answer of the 200 answers allowed some cuts in the search, e.g. if the last result of the
al* search for Afghanistan was "Aliabad", then the next query could be
ali* (skipping all the queries between
alh* that had either results already obtained from the
al* query or no result at all).
I did a lightning talk about it at YAPC::Europe in 2002, titled How I captured thousands of Afghan cities in a few hours, where I explained the above as a "possible optimisation". It was added a few weeks later.
The location search form changed in early May 2011 (I fixed the module in December 2013). On January 24, 2014, it finally switched to using Google for finding the latitude and longitude of any city in the world. (I found the dates the site changed its interface thanks to the wonderful and little known CPAN Analysis site, and these two reports.)
With the gazetteer service of the site gone, my module became instantly obsolete. I removed it from CPAN in July, after finding out the reason why the tests kept failing.