Why you should scrape HTML with regexes

I used to always scrape little bits of HTML properly, but I rarely bother anymore. I wrote this little bit of code today because I didn't find something on CPAN to do it. That is, nothing that didn't require authentication like Net::Twitter:

when (m[//twitter\.com/(?<user>[^/]+)/status/(?<id>\d+)]) {
    require LWP::Simple;
    my $user = $+{user};
    if (my $content = get($ARGV[0])) {
        my ($when) = $content =~ m[<span class="published timestamp"[^>]+>(.*?)</span>];
        my ($twat) = $content =~ m[<meta content="(?<tweet>.*?)" name="description" />];
        if ($when and $twat) {
            say "Twat by $user $when: $twat";

It prints the full contents of a tweet to IRC given a URL, here's the full program.

In a previous life I'd have used HTML::TreeBuilder, but after I actually wrote a bunch of simple scraping programs I found that I was wasting my time. Proper HTML parsing where you're extracting the value of a tag given a class breaks just as hard when the other side completely changes their HTML layout.

That's what they usually end up doing in my experience. I've rewritten the guts of POE::Component::IRC::Plugin::ImageMirror's _mirror_imgshack at least 5 times now, none of which would have been helped with a proper parser.

Of course I'd use a proper parser for something like getting the nth element of a table, I'm not crazy. I just can't be bothered for something simple like the above.

Don't be afraid to use powertools, but don't be afraid to use the incorrect and stupid version whose API you don't have to figure out all over again when stuff breaks unexpectedly either.


Even regex's are too much work as far as I'm concerned.
The recent Mojo addition, Mojo::DOM, accepts css3 selectors:

Example with Mojo::Client:

print Mojo::Client->new->get('http://twitter.com/user')->success->dom->at('span.published');

when (m[//twitter.com/(?[^/]+)/status/(?\d+)]) {
    my $user = $+{user};
    eval "use Web::Scraper; use URI; 1" or die $@;
    my $twat = ( scraper {
        process "span.published.timestamp",
            when => "TEXT";
        process q[//meta[@name="description"]],
            what => sub { $_->attr("content") };
    } )->scrape( URI->new( $ARGV[0] ) );
    if ($twat{when} and $twat{what}) {
        say "Twat by $user $twat{when}: $twat{what}";

I know which of these I’d rather look at…

This reminds me about this and this on Perl Monks some time ago... I agree that when the info you're after can be detected by some local context, going for the global context can be weaker and more time consuming.

Leave a comment

About Ævar Arnfjörð Bjarmason

user-pic Blogging about anything Perl-related I get up to.