How can I make this Perl code run faster ?

Hello world and Happy Holidays!! This is my first time blogging in blogs.perl.org and I figure I take this opportunity to ask the Perl community for suggestions on how I can make this Perl code run faster.

https://github.com/itcharlie/IDPcharlie/blob/master/perltools/bundle2pinto.pl

As the name of the script implies, I want to parse a cpan autobundle file so I can generate a list of distribution files from which I can create a Pinto repository. Please note that the script is incomplete and I am just wondering if there is a better approach to generate a list of distribution files.

A sample of an autobundle file can be found in the link below:

https://github.com/itcharlie/IDPcharlie/blob/master/perltools/Snapshot2013121700.pm

Please note that the way to run this program is by passing the autobundle filename as an argument like so:

./bundle2pinto.pl Snapshot2013121700.pm

Below is a copy of the code:


#!/usr/bin/env perl 

# This script will parse a cpan bundle file and create a pinto repository
# with modules listed in the bundle file. 

use strict; 
use Data::Dumper;
use LWP::Simple;
use JSON;

my $file = $ARGV[0];

open( my $fh , "<", $file ) 
        or die "Unable to open $file \n $!";


# Parse bundle file and determine distribution file url for each module version
my %modules =();
my %undef_versions = ();
my $head_cont = 0;

while ( my $line  = <$fh>) {

        if ( $line =~ /^\=head1\sCONTENTS/ ) {
                $head_cont = 1;
                next;                
        }

        next if ( $head_cont == 0 || $line =~ /^$/);
        last if ( $head_cont && $line =~ /^\=head1/ );

        $line =~ s/ +/ /g;
        my @fields =  split( ' ', $line);

        # skip functions 
        next if $fields[0] =~ /^[a-z]/;
        # skip undef module versions 
        if ( $fields[1] == "undef") {
                $undef_versions{$fields[0]} = 1;
                next;        
        }

        $modules{$fields[0]}{'VERSION'} = $fields[1];
}

my %dist_archives =();
for my $mod ( keys %modules ) {
        # Store the archive url in the hash for the modules that do have versions defined
        my $archive_url = dist_archive_url( $mod, $modules{$mod}{'VERSION'} ) ;
        next if ( ! $archive_url );
        $dist_archives{$archive_url} = 1;
}

print Dumper \%dist_archives;
#print Dumper \%undef_versions;


# Attempt to search for Module archive via cpan api.
sub dist_archive_url {

        my ($mod , $version) = @_;

        my $json = JSON->new();
        my $search_cpan = "http://search.cpan.org/api/";
        my $mod_url  = $search_cpan . "module/" . $mod;        
        my $mod_data_json  = get( $mod_url);
        my $mod_data =  $json->decode(  $mod_data_json ) ;

        my $dist = $mod_data->{'distvname'};
        $dist =~ s/\-\d+\.\d+$//; # remove the version number
        my $dist_url = "http://search.cpan.org/api/dist/" . $dist ;
        my $dist_data_json  = get( $dist_url);
        my $dist_data =  $json->decode(  $dist_data_json ) ;

        my $archive_url;        
        for my $release  ( @{$dist_data->{releases}} ) {
                if ( $release->{version} eq $version ) {
                        $archive_url =  $release->{cpanid} . "/" . $release->{'archive'};
                }
        }
         return $archive_url;
}



# Create a Pinto repo and pass in the ur

Also I would like to know if this something that has been done before and If so how did you solve this problem?

8 Comments

The underscores in the URLs are being interpreted as italics in markdown. :-(

Find the parts that are slow and make them not slow. :)

I don't say that to be flippant. It's what you're going to do throughout your programming career. I talk about benchmarking and profiling in Mastering Perl, but the quick start is to use Devel::NYTProf to see what's going on.

Consider, though, that all that network activity is likely to be a problem if you're on a slow or high latency link. You might run some benchmarks to see how much time is taken up accessing that stuff.

Good luck, :)

If many calls with high latency is your problem, I suggest a non blocking script using Mojo::UserAgent

The first URL starts with "hhttps" instead of "https"

Please let me know how this works out for you. I'd really like to add this kind of feature to pinto directly. Accurately mapping module versions to distributions is non-trivial, because a given version can appear in many distributions. And things get more complicated if you've installed multiple versions of a distribution but they don't all have the same packages.

But using the autobundle might be good enough. I've also experimented with the using the -l option of App::CPAN, which lists all installed modules and versions.

Leave a comment

About itcharlie

user-pic I blog about Perl.