Don't be afraid to use Perl

A long time ago, I wrote this regex to handle a Perl problem I had:

$html =~ s{
             (<a\s(?:[^>](?!href))*href\s*)
             (&(&[^;]+;)?(?:.(?!\3))+(?:\3)?)
             ([^>]+>)
          }
          {$1 . decode_entities($2) .  $4}gsexi;

It didn't work. That not only taught me how to use a proper parser, it led me to write HTML::TokeParser::Simple.

However, some people still insist upon using regular expressions for parsing HTML.

How do you know when you should use a module instead? When the risks outweigh the rewards. Otherwise, go ahead and use that regex. We're here to get jobs done and we should know when is the right time for dogma and when is the right time to extract that href. Don't be afraid to use Perl.

(And when it's a full-blown production system and it's critical that the information is correct, I'll be the first to yell at you for the regex).

5 Comments

Every time I see someone use a regex to parse HTML, I have to think of this.

The problem is that people who don't know to use a module are the people who don't know how to weigh the risks and rewards. That is, many of the questions I see on Stackoverflow dealing with HTML and regexes indicate that the person doing the job isn't a very good programmer and hasn't had to munge a lot of HTML. There's no basis for them to judge anything. Indeed, they don't even value the advice, such as yours, that most people give.

There are several dimensions to this problem: the person's ability with the tools and the person's understanding of the problem. I tend to think that people who insist on using regex for HTML (or for most of their work), have a problem with both.

Being deficient in either of those areas isn't a big deal. It's very easy to solve both of them if you know that you have either problem. The third, and worst, problem is thinking that you know your tools and you know your problem. The best programmers I know realize that they could know a lot more about their tools and that they don't fully understand the problem. That's the trick that makes them think about what they are doing, consider trade-offs, and so on.

I've been doing Perl for a long time, and I'm still learning how to use it properly. :)

I blame the school system. Everyone is taught, from kindergarten onward, to rely only on their own efforts. Copying is cheating. Using someone else's work is wrong. The result is that when they get into the work force, they treat every problem as though it was a school assignment. They go off and try to solve it themselves. Not that you can blame them; it's all they know. But it is disheartening to discover that a junior programmer has spent the last week trying to solve a problem whose solution is already in your code base. Sigh. If any of you are teaching Perl to young minds, please teach them the importance of using resources like CPAN. That they don't have to solve every problem on their own. Using other peoples' code is both faster and more effective.

When I have to do some simple parsing of HTML I usually just use regexes if the parsing doesn't involve a tree structure (like getting the nth thing in a table).

Why? Because even though m[<input name="foo" value="(.*?)" />] isn't pretty, I'm going to have to change the code anyway when whoever's emitting the HTML changes their program.

As an example I had some code that posted to imageshack and used a proper HTML parser. I had to change that anyway each time they changed the structure of their HTML.

Or... just teach Python instead of Perl =)

I think most people just want to get the job done as fast as possible. Most times when I use those CPAN modules, it's a hassle to read the documentation and learn how to use it. A better solution in the long-run, but we humans like to get things done fast.

Leave a comment

About Ovid

user-pic Have Perl; Will Travel. Freelance Perl/Testing/Agile consultant. Photo by http://www.circle23.com/. Warning: that site is not safe for work. The photographer is a good friend of mine, though, and it's appropriate to credit his work.