Backing up Berlios.de
Last year it was announced that www.berlios.de was going to be shut down. People were asking if someone was going to back it up to save all those open source projects. I decided to gave it a shot and I was able to backup all of the berlios projects. While working on the process of uploading it to a new host (I was looking at github) it was announced that the site was saved, so I set the project aside.
Digging around I found this code and decided to post it so that people who are trying to build data mining style tools can have another real world example. github.com/kimmel/backup-berlios.de contains two scripts, a shared library and a data file.
01_fetch_project_list.pl builds a list of all the projects on Berlios and writes it to a file.
02_download_repos.pl takes that data file and downloads everything it can.
I broke the whole process into two scripts for a variety of reasons, mainly so I could resume the project downloads by having it premapped and then skipping anything that already existed.
Now when downloading any large number of web pages two key optimizations can be made: caching and compression. WWW::Mechanize::Cached::GZip works out great because it requests a gzip-compressed response which is fantastic for web pages and automatically caches the results for later. When downloading the projects themselves I was simply fetching archive files so WWW::Mechanize::* is just plain overkill in terms of features. LWP::UserAgent was perfect for this simple task.
I didn't need to worry about unicode since berlios uses only ASCII characters for project names. For logging the log filename needed to contain YYYY-MM-DD and automatically rotate itself. A combination of Log::Dispatch::File::Stamped and Log::Dispatch::Screen fulfilled these requirements. I also considered doing the work in parallel but decided against it. I prefer to collect data and be a nice web citizen about it instead of just slamming a server with as many requests as possible constantly.