Web Scraping with Modern Perl (Part 1)

tl;dr

Grab the gist with the complete, working source code.

I often hear the question: "so, you're Perl guy, could you show me how to make a web crawler/spider/scraper, then?" I hope this post series will become my ultimate answer :)

First of all, I compiled a small list of features that people expect of crawlers nowadays:

  1. capability of concurrent, persistent connections;
  2. usage of CSS selectors to process HTML;
  3. easily modifiable source instead of a flexible OOP inheritance structure;
  4. LESS DEPENDENCIES!

Well, sorry WWW::Mechanize, it's not your turn. Instead, the first example will be based on Mojo::UserAgent from the Mojolicious framework. Being event-driven and low on dependencies (none except the Mojolicious itself) is specially attractive for Perl newcomers with some jQuery literacy.

Boilerplate

Let's call our project mojo-crawler.pl. Here's how it begins:

#!/usr/bin/env perl
use 5.010;
use open qw(:locale);
use strict;
use utf8;
use warnings qw(all);

use Mojo::UserAgent;

# FIFO queue
my @urls = map { Mojo::URL->new($_) } qw(
    http://sysd.org/page/1/
    http://sysd.org/page/2/
    http://sysd.org/page/3/
    http://sysd.org/page/4/
    http://sysd.org/page/5/
    http://sysd.org/page/6/
);

# Limit parallel connections to 4
my $max_conn = 4;

# User agent following up to 5 redirects
my $ua = Mojo::UserAgent
    ->new(max_redirects => 5)
    ->detect_proxy;

# Keep track of active connections
my $active = 0;

Note that I'm using my very own server as a guinea pig. Consider that my oath on the safety of the parallelization method I chose.

Event loop

Here we keep a constant-sized pool of active connections, populating it with URLs from our FIFO queue. The anonymous sub {} fires every time the event loop is idle (0-second timer):

Mojo::IOLoop->recurring(
    0 => sub {
        for ($active + 1 .. $max_conn) {

            # Dequeue or halt if there are no active crawlers anymore
            return ($active or Mojo::IOLoop->stop)
                unless my $url = shift @urls;

            # Fetch non-blocking just by adding
            # a callback and marking as active
            ++$active;
            $ua->get($url => \&get_callback);
        }
    }
);

Now, start the event loop unless it is already started somewhere else. In this code, it won't be started anywhere else. But who knows how deep the Copy&Paste will bury it in future?!

# Start event loop if necessary
Mojo::IOLoop->start unless Mojo::IOLoop->is_running;

Download callback

Every completed download ends here. Even the failed ones. Thus, when the download is complete, we decrease the $active counter to free a connection slot:

sub get_callback {
    my (undef, $tx) = @_;

    # Deactivate
    --$active;

    # Parse only OK HTML responses
    return
        if not $tx->res->is_status_class(200)
        or $tx->res->headers->content_type !~ m{^text/html\b}ix;

    # Request URL
    my $url = $tx->req->url;

    say $url;
    parse_html($url, $tx);

    return;
}

# Not implemented yet!
sub parse_html { return }

Fear not, the parse_html() is implemented right below!

Take one

Let's make sure this code actually does what it does, while it is still low on the line count:

$ perl mojo-crawler.pl 
http://sysd.org/page/2/
http://sysd.org/page/4/
http://sysd.org/page/3/
http://sysd.org/page/5/
http://sysd.org/
http://sysd.org/page/6/
$

Fine, the downloads were completed in the order of the time it took to download each resource. Oh, and http://sysd.org/page/1 simply redirects to http://sysd.org/.

Adding some depth

The most difficult part of making web crawlers isn't making them start; it's making them stop. Our complete parse_html() also takes care of feeding the URL queue with the URLs extracted from <a href="..."> links. Plus, it makes a trivial verification on:

  1. Protocols (only HTTP and HTTPS);
  2. Path depth (don't go deeper than /a/b/c);
  3. URL revisiting (don't download the same resource over and over);
  4. Cross-domain links (not allowed).

And, to show we've been there, let's print the title of the page:

sub parse_html {
    my ($url, $tx) = @_;

    say $tx->res->dom->at('html title')->text;

    # Extract and enqueue URLs
    for my $e ($tx->res->dom('a[href]')->each) {

        # Validate href attribute
        my $link = Mojo::URL->new($e->{href});
        next if 'Mojo::URL' ne ref $link;

        # "normalize" link
        $link = $link->to_abs($tx->req->url)->fragment(undef);
        next unless grep { $link->protocol eq $_ } qw(http https);

        # Don't go deeper than /a/b/c
        next if @{$link->path->parts} > 3;

        # Access every link only once
        state $uniq = {};
        ++$uniq->{$url->to_string}
        next if ++$uniq->{$link->to_string} > 1;

        # Don't visit other hosts
        next if $link->host ne $url->host;

        push @urls, $link;
        say " -> $link";
    }
    say '';

    return;
}

Take two

This time, it will be a lot slower, as every internal link is followed and downloaded. The crawler will print the accessed URL, the title of the page, and the extracted non-visited links:

$ "time" perl mojo-crawler.pl 
http://sysd.org/
sysd.org
 -> http://sysd.org/tag/benchmark/
 -> http://sysd.org/tag/command-line-interface/
 -> http://sysd.org/tag/console/
 -> http://sysd.org/tag/overhead/
 -> http://sysd.org/tag/terminal/
 -> http://sysd.org/tag/teste/
 -> http://sysd.org/tag/tty/
 -> http://sysd.org/tag/velocidade/
 -> http://sysd.org/tag/browser/
 -> http://sysd.org/tag/deprecation/
 -> http://sysd.org/tag/ie/
 -> http://sysd.org/tag/microsoft/
 -> http://sysd.org/tag/navegador/
 -> http://sysd.org/tag/webdesign/
 -> http://sysd.org/tag/webdev/
 -> http://sysd.org/tag/api/
 -> http://sysd.org/tag/hack-2/
 -> http://sysd.org/tag/integration/
 -> http://sysd.org/tag/rest/
...
27.73user 0.88system 3:48.46elapsed 12%CPU (0avgtext+0avgdata 98272maxresident)k
0inputs+8outputs (0major+6749minor)pagefaults 0swaps
$

A very important final note: albeit this tiny crawler operates through the recursive traversal of links, it is implemented in an iterative way. Thus, it is very light on memory consumption. In fact, the only structure that hogs the RAM is the $uniq hashref. tie it to any kind of persistent storage if that concerns you. The FIFO queue @urls could grow a lot if the crawled site has dynamically-generated link lists (or even broken pagination). So, not storing it in some kind of key/value database is a bit reckless.

Conclusion

Despite this being a toy spider, I believe it is good enough to solve 80% of web crawling/scraping problems. The remaining 20% would require much more code, tests and infrastructure (A.K.A. the Pareto principle). Please, don't reinvent the wheel, check out the CommonCrawl project first! And keep checking my Perl blog for more on that 80% focused web-crawling ;)

Acknowledgements

Continued

Web Scraping with Modern Perl (Part 2 - Speed Edition)

Leave a comment

About stas

user-pic Just another lazy, impatient and arrogant IT guy.