URI::Title Archives

Why you should scrape HTML with regexes

By Ævar Arnfjörð Bjarmason on June 17, 2010 10:44 PM under URI::Title, html, regex, scraping, twitter

I used to always scrape little bits of HTML properly, but I rarely bother anymore. I wrote this little bit of code today because I didn't find something on CPAN to do it. That is, nothing that didn't require authentication like Net::Twitter:

when (m[//twitter\.com/(?<user>[^/]+)/status/(?<id>\d+)]) {
    require LWP::Simple;
    LWP::Simple->import;
    my $user = $+{user};
    if (my $content = get($ARGV[0])) {
        my ($when) = $content =~ m[<span class="published timestamp"[^>]+>(.*?)</span>];
        my ($twat) = $content =~ m[<meta content="(?<tweet>.*?)" name="description" />];
        if ($when and $twat) {
            say "Twat by $user $when: $twat";

It prints the full contents of a tweet to IRC given a URL, here's the full program.

In a previous life I'd have used HTML::TreeBuilder, but after I actually wrote a bunch of simple scraping programs I found that I was wasting my time. Proper HTML parsing where you're extracting the value of a tag given a class breaks just as hard when the other side completely changes their HTML layout.

That's what they usually end up doing in my experience. I've rewritten the guts of POE::Component::IRC::Plugin::ImageMirror's _mirror_imgshack at least 5 times now, none of which would have been helped with a proper parser.

Of course I'd use a proper parser for something like getting the nth element of a table, I'm not crazy. I just can't be bothered for something simple like the above.

Don't be afraid to use powertools, but don't be afraid to use the incorrect and stupid version whose API you don't have to figure out all over again when stuff breaks unexpectedly either.