Web Scraping with Modern Perl (Part 2 - Speed Edition)

tl;dr

Grab the gist with the complete, working source code. Benchmark it against the one featured on the previous article:

$ \time perl mojo-crawler.pl
23.08user 0.07system 2:11.99elapsed 17%CPU (0avgtext+0avgdata 84064maxresident)k
16inputs+0outputs (0major+6356minor)pagefaults 0swaps

$ \time perl yada-crawler.pl
8.83user 0.30system 0:12.60elapsed 72%CPU (0avgtext+0avgdata 131008maxresident)k
0inputs+0outputs (0major+8607minor)pagefaults 0swaps

How can it be 10x faster while consuming less than a half of CPU resources?!

Perl as a glue

Sorry, I had cheated a bit on the mojo-crawler.pl benchmark results. It implicitly uses the EV, a high performance full-featured event loop library when it is present. And it is not required for Mojolicious to work properly. Let's disable it:

$ MOJO_REACTOR=Mojo::Reactor::Poll time perl mojo-crawler.pl
113.99user 13.37system 2:08.46elapsed 99%CPU (0avgtext+0avgdata 83808maxresident)k
2912inputs+0outputs (18major+5789minor)pagefaults 0swaps

The elapsed time is the same with ou without EV, but now the pure-Perl crawler hogs the CPU!

Why? EV provides an interface to libev, which clearly does a better connection polling job than the 100% interpreted code. The bridge between Perl and the compiled library is called XS:

XS is an interface description file format used to create an extension interface between Perl and C code (or a C library) which one wishes to use with Perl.

Actually, CPAN is full of high performance XS-based modules for many tasks:

Thus, an efficient and fast web crawler/scraper could be constructed with those "bare-metal" building blocks ;)

Show me the code!

#!/usr/bin/env perl
use 5.016;
use common::sense;
use utf8::all;

# Use fast binary libraries
use EV;
use Web::Scraper::LibXML;
use YADA 0.039;

YADA->new(
    common_opts => {
        # Available opts @ http://curl.haxx.se/libcurl/c/curl_easy_setopt.html
        encoding        => '',
        followlocation  => 1,
        maxredirs       => 5,
    }, http_response => 1, max => 4,
)->append([qw[
    http://sysd.org/page/1/
    http://sysd.org/page/2/
    http://sysd.org/page/3/
]] => sub {
    my ($self) = @_;
    return  if $self->has_error
        or not $self->response->is_success
        or not $self->response->content_is_html;

    # Declare the scraper once and then reuse it
    state $scraper = scraper {
        process q(html title), title => q(text);
        process q(a), q(links[]) => q(@href);
    };

    # Employ amazing Perl (en|de)coding powers to handle HTML charsets
    my $doc = $scraper->scrape(
        $self->response->decoded_content,
        $self->final_url,
    );

    printf qq(%-64s %s\n), $self->final_url, $doc->{title};

    # Enqueue links from the parsed page
    $self->queue->prepend([
        grep {
            $_->can(q(host)) and $_->scheme =~ m{^https?$}x
            and $_->host eq $self->initial_url->host
            and (grep { length } $_->path_segments) <= 3
        } @{$doc->{links} // []}
    ] => __SUB__);
})->wait;

Now what?!

The example above has half the lines of code of the previous one. This comes at a cost of installing a bunch of external dependencies from the CPAN:

$ cpanm AnyEvent::Net::Curl::Queued EV HTML::TreeBuilder::LibXML Web::Scraper utf8::all

Despite the use 5.016 pragma, this code works fine on Perl 5.10 if you get rid of the __SUB__ reference.

So, what approach is the better one? Obviously, it depends. There is no silver bullet: web crawling is ultimately I/O-bound! However, specialized and well-tested libraries guarantee the I/O-boundness. For instance, trimming the ::LibXML part from the use Web::Scraper::LibXML statement considerably slows down our tiny crawler, because the HTML parsing will allocate more CPU cycles than the connection polling.

As the edge case, let's see how the venerable GNU Wget tool (see also yada, which comes bundled together with the AE::N::C::Queued distribution) behaves:

$ "time" wget -r --follow-tags a http://sysd.org/
0.23user 0.41system 1:10.20elapsed 0%CPU (0avgtext+0avgdata 23920maxresident)k
0inputs+40704outputs (0major+4323minor)pagefaults 0swaps

Despite it's clear disadvantage of using single connection, it is almost completely I/O-bound since it's URL extraction code doesn't require complete parsing of the HTML.

Leave a comment

About stas

user-pic Just another lazy, impatient and arrogant IT guy.