June 2010 Archives

Why you should scrape HTML with regexes

I used to always scrape little bits of HTML properly, but I rarely bother anymore. I wrote this little bit of code today because I didn't find something on CPAN to do it. That is, nothing that didn't require authentication like Net::Twitter:

when (m[//twitter\.com/(?<user>[^/]+)/status/(?<id>\d+)]) {
    require LWP::Simple;
    my $user = $+{user};
    if (my $content = get($ARGV[0])) {
        my ($when) = $content =~ m[<span class="published timestamp"[^>]+>(.*?)</span>];
        my ($twat) = $content =~ m[<meta content="(?<tweet>.*?)" name="description" />];
        if ($when and $twat) {
            say "Twat by $user $when: $twat";

It prints the full contents of a tweet to IRC given a URL, here's the full program.

In a previous life I'd have used HTML::TreeBuilder, but after I actually wrote a bunch of simple scraping programs I found that I was wasting my time. Proper HTML parsing where you're extracting the value of a tag given a class breaks just as hard when the other side completely changes their HTML layout.

That's what they usually end up doing in my experience. I've rewritten the guts of POE::Component::IRC::Plugin::ImageMirror's _mirror_imgshack at least 5 times now, none of which would have been helped with a proper parser.

Of course I'd use a proper parser for something like getting the nth element of a table, I'm not crazy. I just can't be bothered for something simple like the above.

Don't be afraid to use powertools, but don't be afraid to use the incorrect and stupid version whose API you don't have to figure out all over again when stuff breaks unexpectedly either.

How I setup my Debian server to run perl 5.13.1 with perlbrew

In May I decided to stop using Debian's perl 5.10.1 in favor of using a 5.13.1 built with perlbrew, and CPAN modules built with cpanminus. It's been great, here's how I did it.

Before switching over I ignored Debian's perl library packages, and installed everything with cpanm into /usr/local. But since I wanted to use the new post-5.10 features of Perl I thought I might as well replace all of it and use a newer perl.

What I did:

  • Created a v-perlbrew user. All users on the server can use this centrally managed Perl and its modules
  • Added this to everyone's .bashrc: test -f ~/perl5/perlbrew/etc/bashrc && source ~/perl5/perlbrew/etc/bashrc
  • Made a list of CPAN modules that I need. When I upgrade the perlbrew perl I can just run grep -v ^# cpan-modules | cpanm to get all the required modules with the new perl.
  • Ran around changing PATH in crontabs, Apache settings etc. so that everything that isn't internal to Debian itself uses perlbrew's perl instead of /usr/bin/perl.

Getting the PATHs right everywhere turned out to be the hardest part. A lot of things in Debian have a path like /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/X11R6/bin. I haven't found out how to change that, presumably some of them are hardcoded into executables like bash.

To get around that I have to hardcode the PATH to perlbrew in every crontab that uses perl. For a full list of these and other changes I've made the output of git --git-dir /etc/.git log -p --reverse -Sperlbrew available as a Gist.

The only caveat I've encountered is that there's one global perlbrew bashrc in ~/perl5/perlbrew/etc/bashrc. So if I can't use perlbrew switch to only switch some users onto a given perl. It would be neat if perlbrew supported having the current symlinks in a local ~/perl5 while the actual binaries and modules were in ~v-perlbrew/perl5.

Check out Devel::NYTProf 4.00

Tim Bunce's Devel::NYTProf has a bunch of improvements in version 4.00, which was released yesterday.

The compatibility problem with Devel::Declare code like Module::Signatures::Simple that I previously blogged about has been fixed. It can now profile inside string evals, and more.

Update: Tim Bunce now has a posting about NYTProf 4.00 on his blog.

Adding gettext support to Git

I have an RFC patch series to Git to add gettext localization support to it. So that eventually you'll be able to configure Git to e.g. shout error messages at you in German. Won't that be a nice variant of the current abuse? I think so.

Here's the latest version of the patch series I posted to the Git mailing list.

For the Perl side of things (Git is partially implemented in Perl) I'm using libintl-perl's Locale::Messages. It was very pleasant to work with it. I wonder why more Perl projects don't use it instead of Perl-y libraries like Locale::Maketext.

Maybe it's just the GNU gettext dependency they're trying to get rid of, although libintl-perl includes a Pure-Perl version of the tools it provides .mo, so probably not.

Actually the way most open source projects do localization is "not at all". I don't blame them, I certainly can't be bothered most of the time. But I wonder to what degree we're losing potential users & contributors because of this.

There was a recent-ish study of social networks on GitHub where it was evident that a lot of Japanese Perl users had formed a social-ghetto amongst themselves. I've seen a few trending Perl repositories that only have README files in Japanese.

Maybe better localization tools - and most of all - a commitment to use them would help to bridge some of that.

About Ævar Arnfjörð Bjarmason

user-pic Blogging about anything Perl-related I get up to.