Download a mailman archive

By Kimmel on March 9, 2013 5:58 AM

Oracle is closing down the opensolaris.org site on March 24th, which is inconvenient for the rest of us. I wanted to grab the mailman archives for the mailing lists so I fired up a search engine and looked for any existing open source projects to do this. After trying two different scripts that did not quite work right I realized it would just be faster for me to write what I need.

I started by fetching the listinfo page which has links to all the lists archived and took a look at the data. Based on the page structure the easiest method was to iterate over all the links in the page and only go deeper if it lead to a mailing list's main page. On the mailing list page I follow the link to the archives page. Then just scan all the links in the page for .gz files and download them. WWW::Mechanize provides a save_content() function which handles saving the files locally with minimal effort. That was all it took.

Optimization

The most time consuming part of this whole process is fetching 8,000+ months of archives so I made sure to cache each page and file as I went along and use gzip as much as the service supports. I achieved both of these steps just by using WWW::Mechanize::Cached::GZip and CHI for the caching object. Here is the full program in all its shortness.

0 comments

Tagged as:

scraper

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Kimmel

I like writing Perl code and since most of it is open source I might as well talk about it too. @KirkKimmel on twitter

More info »

Kirk Kimmel