Why you should scrape HTML with regexes

By Ævar Arnfjörð Bjarmason on June 17, 2010 10:44 PM under URI::Title, html, regex, scraping, twitter

I used to always scrape little bits of HTML properly, but I rarely bother anymore. I wrote this little bit of code today because I didn't find something on CPAN to do it. That is, nothing that didn't require authentication like Net::Twitter:

when (m[//twitter\.com/(?<user>[^/]+)/status/(?<id>\d+)]) {
    require LWP::Simple;
    LWP::Simple->import;
    my $user = $+{user};
    if (my $content = get($ARGV[0])) {
        my ($when) = $content =~ m[<span class="published timestamp"[^>]+>(.*?)</span>];
        my ($twat) = $content =~ m[<meta content="(?<tweet>.*?)" name="description" />];
        if ($when and $twat) {
            say "Twat by $user $when: $twat";

It prints the full contents of a tweet to IRC given a URL, here's the full program.

In a previous life I'd have used HTML::TreeBuilder, but after I actually wrote a bunch of simple scraping programs I found that I was wasting my time. Proper HTML parsing where you're extracting the value of a tag given a class breaks just as hard when the other side completely changes their HTML layout.

That's what they usually end up doing in my experience. I've rewritten the guts of POE::Component::IRC::Plugin::ImageMirror's _mirror_imgshack at least 5 times now, none of which would have been helped with a proper parser.

Of course I'd use a proper parser for something like getting the nth element of a table, I'm not crazy. I just can't be bothered for something simple like the above.

Don't be afraid to use powertools, but don't be afraid to use the incorrect and stupid version whose API you don't have to figure out all over again when stuff breaks unexpectedly either.

4 comments

Tagged as:

flamebait, HTML, regex, scraping, twitter, URI::Title

4 Comments

tempire | June 18, 2010 12:19 AM | Reply

Even regex's are too much work as far as I'm concerned.
The recent Mojo addition, Mojo::DOM, accepts css3 selectors:

Example with Mojo::Client:

print Mojo::Client->new->get('http://twitter.com/user')->success->dom->at('span.published');

Aristotle | June 18, 2010 1:46 AM | Reply

when (m[//twitter.com/(?[^/]+)/status/(?\d+)]) {
    my $user = $+{user};
 
    eval "use Web::Scraper; use URI; 1" or die $@;
 
    my $twat = ( scraper {
        process "span.published.timestamp",
            when => "TEXT";
        process q[//meta[@name="description"]],
            what => sub { $_->attr("content") };
    } )->scrape( URI->new( $ARGV[0] ) );
 
    if ($twat{when} and $twat{what}) {
        say "Twat by $user $twat{when}: $twat{what}";

I know which of these I’d rather look at…

Flavio Poletti | June 18, 2010 1:33 PM | Reply

This reminds me about this and this on Perl Monks some time ago... I agree that when the info you're after can be detected by some local context, going for the global context can be weaker and more time consuming.

Ævar Arnfjörð Bjarmason replied to comment from Aristotle | October 7, 2010 10:29 AM | Reply

You're right, when you add URI decoding into the mix Web::Scraper is much easier, actually it's easier even if you don't do that.

I'll probably just use Web::Scraper instead of regexes now, even if I know the regexes work and I don't have to maintain them. Web::Scraper is a lot easier than LWP + HTML::TreeBuilder.

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Ævar Arnfjörð Bjarmason

Blogging about anything Perl-related I get up to.

More info »

Ævar Arnfjörð Bjarmason