Skipping large files when mirroring your mini CPAN

My Internet connection at home is not great: it's rather slow and flaky. When I ran minicpan to update my mini CPAN earlier today, the process always seemed to choke on this file: id/D/DG/DGINEV/Lingua-EN-SENNA-0.03.tar.gz: exit with Connection time-out error. The first run I thought it must be my connection and simply re-ran the script. After the second and third and being stuck in the same file, I got curious. Lo and behold, the file is 185MB big!

I then typed this to peek into my CPAN mirror:

% cd /cpan
% find -type f -size +10M

Turns out, there are quite a few files that are over 10M in size. Seeing that I (currently) don't see any need for any of the distributions listed above, I entered:

% find -type f -size +10M -exec rm {} \;

Not too bad, I shaved around 1 GB by doing this (from about 4210MB to 3119MB). This is on a miniscule-sized SSD, so more free space is always welcome.

To prevent mirroring the large files on subsequent updates, I wrote a patch module LWP::UserAgent::Patch::FilterMirrorMaxSize. This patch installs a wrapper for the LWP::UserAgent's mirror() method. It will first check the local file; if it already exists and size is above specified limit, it returns 304 response. Otherwise, it will perform a HEAD request to the remote URL. If Content-Length header's value is above limit, 304 response is returned. Otherwise, mirror as usual. This will make the mirror process slower due to the HEAD request before the GET for each file, but will safely skip all large files.

To use it:

% PERL5OPT="-MLWP::UserAgent::Patch::FilterMirrorMaxSize=-size,10485760,-verbose,1" minicpan -l /cpan -r http://mirrors.kernel.org/cpan/

By the way, creating a patch module is a quick and easy way to add/modify functionality of another module, and to package/distribute that modification. If you create a subclass, you'll need to modify all of the original class' users. If you send a patch or pull request to the original distribution, it can take a long time to get merged (and that is if the author sees the patch as something merge-worthy).

Leave a comment

About Steven Haryanto

user-pic A programmer (mostly Perl 5 nowadays). My CPAN ID: SHARYANTO. I'm sedusedan on perlmonks. My twitter is stevenharyanto (but I don't tweet much). Follow me on github: sharyanto.