Q: When not to use Regexp? A: HTML parsing

It always starts out as something simple and innocent and then the Internet ruins it.

So I am giving a data mining talk at Ohio LinuxFest 2012 and surprise, surprise there is going to be a nice helping of Perl. So I am on the internet doing research looking for some simple scrapers and collectors to mention in my talk. I always prefer to give multiple examples for any problem since programming does not have a one size fits all model. To make a long story short I found a bunch of different social media scapers. The problem I found with most of them was the same. Things like this Ruby example:

another Ruby example (the comment is from the original source):

The Python ones I found were a little more deceptive. Here is what I found on the surface:

So I see BeautifulSoup included and I am thinking that must be like HTML::Parser right? Wrong. Instead I find these:

What about HTMLParser? More of the same:

The one that probably wasted the most brain cells was this Perl one

Wow, just wow. The untold hours of work to build the above expressions and in some cases knowing that it will break no matter what.

HTML parsing with regexp is the cthulhu way and yet people still do it even though good parsers exist and have already solved this problem. HTML 4.01 was published in 2000 and HTML5 in 2008. HTML::Parser released version 2.14 in 1998 and HTML::TreeBuilder released version 0.50 in 1998 as well.

I shouldn't be surprised considering the empowerment I feel when using regular expressions to solve problems. Then we see things like this `perl -wle 'print "Prime" if (1 x shift) !~ /^1?$|^(11+?)\1+$/' [number]` which can determine if a number is prime or not. Neil already wrote up a walk through of the pattern, if your interested.

Concurrently, I am also thinking "I am not the only person who has had this problem. I bet someone already solved it on the Internet." A little searching and I had found a half dozen examples of using actual parsing libraries to work with HTML. Let me repeat the important part there 'I am not the only person who has had this problem.' This is one of those ideas that should be impressed upon programmers regularly. Tunnel vision while focused on a project happens and it only makes the final solution worse, not better. Another version of this is the 'Not Invented Here' syndrome that seems to invade programmers minds and make them think they can not only do the task better but it will be quicker for them to just rebuild it from scratch. If this happens to you take a step back and really assess the amount of time needed to write something new from scratch.

With all that said I would remind people that htmlparsing.com is an expanding community resource that can help people understand how easy it is to use a parser and not waste brain time on a regexp solution.

    my $mech = WWW::Mechanize->new();
    $mech->get('http://news.ycombinator.com/');

foreach my $link ( $mech->links() ) {
if ( $link->url() =~ m/perl[.]org/xms ) {
say $1;
}
}

Is all it takes to get started. I would like to note that PHP and Python both have a good parsing library in html5lib.

13 Comments

You can use regexps to write a parser with. The point is that you have to write a parser rather than just do a pattern match. The difference is that a parser interprets every single character from the start of the string – so when it encounters f.ex. a literal double quote, it knows whether that double quote delimits an attribute value, or is part of the value of a single-quoted attribute, or just part of the text content of a tag, or whatever.

A pattern match that may start matching at any arbitrary position within the string lacks that context – it is groping blindly. This is what you must avoid.

A pattern match that is only allowed to match-or-fail at a particular position within the string, though? That can well be used to write a parser, even it isn’t one by itself. If not, there will be a loop around it, keeping track of where that pattern as well as a host of other patterns should be tested. Or else, it may be something like Regexp::Grammars, where the pattern is complex enough to be able consume the entire string from the start – or fail to consume any of the string, from the start.

But just because you are using patterns instead of BNF syntax or a manual lexer doesn’t mean you aren’t parsing. BeautifulSoup and HTMLParser most certainly qualify.

Thanks for the shoutout for htmlparsing.com. I based it on bobby-tables.com and would like to get more info dumped into it.

If anyone is interested in adding example code to htmlparsing.com to show people the right way to do parsing, without having to resort to reinventing the wheel and/or regexes, I'd love to have help. Here's the GitHub repository for the site.

Well, I have had to use regexes, without any sensible alternative, when we needed byte-level positioning information for the content. We were using early KinoSearch, and needed to tokenize HTML contents. Kino used byte-level positions at the time. We were skipping the tags in most cases, but retaining some information for contextual information and linking. Also, all the HTML had been generated by Perl modules under our control, so we could constrain their output to avoid funny entities and so on.

This was probably the worst piece of Perl I ever had to write -- although not content-sensitive like the examples above. Given the same circumstances, I'd still have to use regexes. Basically it is a consequence of the fact that we needed to do precise roundtrip content handling, with line/character positioning -- and byte-level positioning too -- and parsers almost invariably strip too much lexical info. I suppose, technically, we were lexing the HTML (with a shallow parse in some places) and that is probably closer to being acceptable.

I've had good luck with Mojo::DOM, which handles both HTML and XML in a relaxed manner. Here is a tiny example.

FYI, HTML-HTML5-Parser passes the majority of the html5lib test suite.

Why bother to parse the HTML cleanly if a change to the site will break your code anyway? Usually you do not care about the structure of the whole HTML page but only the part you are interested in. The simplest way to identify that part is to match it with regexps, not by walking down a syntax tree.

Once you have that small part, there is a case for saying you should parse it as an HTML fragment and walk over the results. But usually you do not want that level of detail or flexibility. A regexp is often easy to write because you start with an example of what you are matching and then selectively replace certain parts with (.+) capturing groups; refine it a bit more as necessary until it works for all the example input you have. You know nothing about what the format will be when you fetch the page tomorrow, so you can't worry about it now. (It will usually be exactly the same, or else quite different requiring a rewrite.)

Usually you do not care about the structure of the whole HTML page but only the part you are interested in. The simplest way to identify that part is to match it with regexps, not by walking down a syntax tree.

I don’t know what you are trying to do but last I tried, I found it far easier to locate a tag on an HTML page with a CSS selector than to laboriously match out the text with a pattern (not to mention having to manually handle charsets, unescaping etc. in that case). Even XPath is still a lot easier.

Are you using a DOM and manually walking it with nested loops or something? I can’t imagine how else you’d find it even harder than using regexps.

Yes, I have sometimes found it easier to use HTML::TableExtract or other modules which truly parse the HTML and walk over its structure. It depends on how well-structured the page is and what you are trying to extract. A CSS selector is a nice declarative approach, and as long as you are matching based on particular attributes rather than 'the third row in the fourth table', you will be fairly robust against minor layout changes.

That said, a regexp approach can be declarative too. Because you can start by taking an example of the HTML you want to match and adding placeholders, it is immediately clear what HTML structure is matched - as long as you keep your regexp nicely formatted and commented using /x. I would paste in some example code I use to scrape Outlook Web Access, but this comment form doesn't allow plain text.

The ideal would be a kind of template language where you write pseudo-HTML with placeholders, this is then parsed into a structured query and that is matched against the parsed HTML document. There are various XML query languages, but I don't think any of them will cope well with the tag soup found on typical web pages.

Yes, that is what the link to “the cthulhu way” points to.

Because you can start by taking an example of the HTML you want to match and adding placeholders, it is immediately clear what HTML structure is matched - as long as you keep your regexp nicely formatted and commented using /x.

No amount of /x will make regexps look less messy than a CSS selector. Aside from that, the regexp will break not only when the layout breaks, but for even the most minor variation, such as the quotes around an attribute changing from double to single, or being removed altogether; the order of attributes changing; a comment being inserted somewhere; etc. Some of these contingencies you can defend against, but at the price of uglying up the regexp badly.

Also, as I said, what you get from a regexp match is a fragment of HTML – even if it does not contain tags. At the least, you have to deal with encodings and entities yourself.

If you know the markup is not going to change in any such way and you are dealing with a highly restricted set of values – something like a script that scrapes your online status out of your WLAN router’s web interface or some such –, then sure, sometimes regexes are easier.

But they really don’t scale very far.

Try something like Web::Scraper or the screenscraping stuff in Mojo sometime if you haven’t. Believe me, you’re making yourself a lot of unnecessary work.

About Kimmel

user-pic I like writing Perl code and since most of it is open source I might as well talk about it too. @KirkKimmel on twitter