How to Parse HTML
This is the first of a series of posts that will detail a Marpa-based "Ruby Slippers" approach to parsing liberal and defective HTML. As an example, let's look at a few lines taken more or less at random from the middle of the perl.org landing page. That page is exactly 400 lines long. Here is line 200 and some lines lines to either side of it.
</td>
<td>
<div class="module">
<a href="http://www.perlfoundation.org/">
<img alt=""
src="http://mc-cdn.pimg.net/images/icons/onion.vee5cb98.png"
width="45" height="45" />
</a>
<h4>
<a href="http://www.perlfoundation.org/">The Perl Foundation</a>
</h4>
<p>
The Perl Foundation is dedicated to the advancement
of the Perl programming language through open discussion,
collaboration, design, and code.
For readability, I've rearranged the whitespace, but otherwise the above is untouched. My more-or-less-random extract is part of a table, and captures the end tag of one cell and the beginning of another.
Marpa::HTML has no trouble fixing this up, but neither does the rendering engine in Firefox. If you cut and paste the above display into a file, and look at it in your favorite web browser, the result will probably be quite reasonable. So why am I saying that Marpa-based HTML parsing is a step forward?
What's the Big Deal?
Marpa::HTML as it sits is a flexible and useable tool, but what makes it different is best appreciated if you think in terms of writing, maintaining or forking Marpa::HTML. The rendering engine in your favorite browser is a monolith of mission-specific code, and in most cases maintaining it is the job of a team. Marpa::HTML was written in a couple of geek-weeks. It is reasonable to expect that even if a change amounted nearly to a total rewrite, it could done in a similar amount of time.
Why is Marpa::HTML that much easier to code? Well, admittedly, Marpa::HTML does not do rendering, which makes life easier. Marpa::HTML just parses HTML, without trying to figure out how to arrange it on a display.
More to our purpose, Marpa::HTML divides HTML parsing into two layers: an HTML specific layer; and a general parsing layer. Most of the complexity goes into the general parsing layer, which is in carefully optimized C, and which contains no HTML-specific code. The HTML-specific layer is small and coded in Perl.
When I started writing Marpa::HTML, my general parsing layer was already written, tested and ready to go. I only needed to write an HTML-specific layer. Marpa::HTML was my first serious exercise of the Ruby Slippers, and my surprise at how easy it was to use inspired the name.
The advantages to breaking up an HTML parser into an HTML-specific layer and a parse engine can be compared to the advantages that were accrued by breaking up the original monolithic web browsers into a user interface and a rendering engine. The interface of modern browsers can be changed without hacking the rendering engine. The Marpa-powered approach to parsing HTML allows the programmer to completely change his approach to HTML without touching the parse engine.
As you read this series of posts, I hope the following will be food for the imagination:- Marpa::HTML could be the basis of a utility. Two I have already written are html_fmt and html_score.
- A Marpa-powered tool could take a customized approach to dealing with defects in HTML.
- A Marpa-powered tool could take a configurable approach to dealing with defects in HTML.
- A Marpa-powered tool could provide an XS-powered engine to speed up HTML::Tree.
- A Marpa-powered tool could understand embedded content, such as the HTML-within-HTML used in the displays in this blog post.
- A Marpa-powered tool could imitate your favorite renderer.
- A Marpa-powered tool could configurably imitate the properties of many renderers.
- A Marpa-powered tool could prototype the HTML renderer of your dreams.
- With its HTML-specific layer recoded in C, a fork could be the easier-to-maintain HTML renderer of your dreams.
The Example
I will finish this post, by taking a first look at what Marpa::HTML does. For that, it is convenient to use html_fmt, my Marpa-powered HTML "pretty-printer". html_fmt takes any file fed to it, interprets it as HTML, and prints a "prettified" version to the standard output. It's a good tool for studying HTML interpretation. Here is what it did with our example:
<!-- Following start tag is replacement for a missing one -->
<html>
<!-- Following start tag is replacement for a missing one -->
<head>
</head>
<!-- Preceding end tag is replacement for a missing one -->
<!-- Following start tag is replacement for a missing one -->
<body>
<!-- Following start tag is replacement for a missing one -->
<table>
<!-- Following start tag is replacement for a missing one -->
<tbody>
<!-- Following start tag is replacement for a missing one -->
<tr>
<!-- Next line is cruft -->
</td>
<td>
<div class="module">
<a href="http://www.perlfoundation.org/">
<img alt=""
src="http://mc-cdn.pimg.net/images/icons/onion.vee5cb98.png"
width="45" height="45" />
</a>
<h4>
<a href="http://www.perlfoundation.org/">
The Perl Foundation
</a>
</h4>
<p>
The Perl Foundation is dedicated to the advancement
of the Perl programming language through open discussion,
collaboration, design, and code.
</p>
<!-- Preceding end tag is replacement for a missing one -->
</div>
<!-- Preceding end tag is replacement for a missing one -->
</td>
<!-- Preceding end tag is replacement for a missing one -->
</tr>
<!-- Preceding end tag is replacement for a missing one -->
</tbody>
<!-- Preceding end tag is replacement for a missing one -->
</table>
<!-- Preceding end tag is replacement for a missing one -->
</body>
<!-- Preceding end tag is replacement for a missing one -->
</html>
<!-- Preceding end tag is replacement for a missing one -->
Whenever html_fmt adds a tag or spots a spurious tag ("cruft") it adds a comment to that effect. As you can see from the above, html_fmt sees that the extract is a table fragment and builds a table around it. An interesting exercise is to take both the example and the html_fmt output and look at them in your favorite browser, comparing html_fmt's reconstruction from the HTML fragment with your browser's.
In posts to come, I'll go into detail about the Ruby Slippers, Marpa-powered, approach to HTML parsing.
Notes
"forking": To be sure, I consider Marpa::HTML a fine tool, and use it often in my own work. And it is possible that I will enhance Marpa::HTML. But a parser like Marpa is a tool. Inventing a new hammer does not evince a desire to single-handedly build every house in the world.
If you'd like to add an example to http://htmlparsing.com/perl.html I'd welcome it.
If that corrected html output is copy/pasted, you have a bug to hunt down. Because the next fragment most definitely isn't correct.
@bart: I noticed the stray
</td>
tag and left it for future discussion. There are several approaches to dealing with a stray</td>
tag. One is to discard it. A second is to match it with a start tag: a<td>
. A third (the one implemented) is the leave it, adding a comment.I made keeping all tags in the original a design goal of html_fmt, so that eliminated the possibility of discarding it. What about creating a start tag to match it? That would mean that every stray
</td>
potentially would bring a whole table into being in the new document. Arguably, that's the right approach, but the possibility of creating 10 new tags to compensate for one stray (which might be a typo) struck me as too much. If an empty table is not what the user wanted (and it almost certainly is not), the source of the problem would be buried in the middle of a cascade of unwanted newly created tags.The solution chosen is to leave the stray
</td>
in the document as cruft. That user can search for the comments and apply the fix of his choice. Whether this is the best compromise is, I admit, open to debate.There is no one way of dealing with defective HTML which is right for all situations. Ideally, an liberal HTML pretty-printer would come with a host of configuration options, or be targeted at a specific application.
@petdance (Andy): Thanks. I am interested in writing a graf. I'll have to think out how to describe the relationship between Marpa::HTML and HTML::Parser. Marpa::HTML uses HTML::Parser (which is a wonderful module by the way), but calling Marpa::HTML a wrapper around HTML::Parser would be like saying yacc is a wrapper around lex.
HTML::Parser is a tokenizer -- it produces a string of tokens. Marpa::HTML finds the structure of an HTML document. In the more usual terminology, what Marpa::HTML does is called "parsing" and what HTML::Parser does is called "lexing".
Not to say that HTML::Parser's name is wrong -- the word "parsing" gets used in a lot of different ways. But it does make it difficult for me to describe Marpa::HTML and its relations to HTML::Parser. In the past, I've resorted to talking about "low-level" and "high-level parsing" which, while not incorrect, is neither standard terminology or very descriptive.
... or you could just use this:
http://search.cpan.org/~tobyink/Task-HTML5-0.103/lib/Task/HTML5/Examples/htmltidy.pod
@Toby: Looking quickly at Task::HTML5, I see that it reflects a lot of thought about and study of the standards. A quick benchmark (round-tripping the perl.org landing page), suggests that if you converted Task::HTML5's parser to work as described in this post, it would run more than twice as fast. I think it might also be easier to maintain in that form.
Interface issues are matters of taste, but I will note some of the differences. In the example in this post, htmltidy throws away all traces of the table, while Marpa::HTML builds the table out. I tested response to a very defective input (the www.perl.org page downloaded as a TEXT file instead of HTML), and Marpa::HTML turns it into a very ugly-looking HTML page, while Task::HTML5 produces no output and instead reports that the file contains a "bad name". The ideal solution would be to make behaviors like this configurable. I'd suggest that would be far easier to do using the Marpa-powered approach, in which case the gain in execution speed would come as a bonus.