A Configurable HTML Parser
[ This is cross-posted from the new home of the Ocean of Awareness blog. ]
This post introduces an HTML parser which is both liberal and configurable. Currently available as a part of a Marpa::R2 developer's release on CPAN, the new Marpa::R2::HTML allows detailed configuration of new tags and respecification of the behavior of existing tags.
To show how a configurable HTML parser works, I will start with a simple task. Let's suppose we have a new tag, call it <acme>. The older, non-configurable version of Marpa, and most browsers, would recognize this tag. But they'd simply give it a default configuration, one which is usually very liberal -- liberal meaning the tag is allowed to contain just about anything, and to go just about anywhere. A configurable parser allows us to specify the new tag's behavior more explicitly and strictly.
Block vs. inline
In this post I am going to assume that the reader knows, or will look up what he needs to know, about HTML. But block vs. inline is a essential HTML concept which is often ignored -- even the excellent Wikipedia page on HTML does not define "inline", although it uses the term in the technical sense twice. Since the concept is also central to this post, let me briefly summarize it.
Quoting from the HTML 4.01 Strict DTD,
HTML has two basic content models: %inline; character level elements and text strings %block; block-like elements e.g. paragraphs and lists
There is also what I will call a "mixed flow", which can contain anything that can appear in either a block or inline flow. (What I call a mixed flow is simply called a "flow" in the HTML 4.01 DTD.)
Significant in the examples are <p>, which is a block element that contains an inline flow and <span>, which is an inline element that contains an inline flow. The <body> tag is an important structural tag, which contains an block flow.
For simplicity I am following HTML 4.01 DTD. HTML 5 uses radically different terminology and is more liberal in what it allows. Differences among standards are an important reason for an HTML parser to be configurable.
Controlling element context
An inline element containing an inline flow
Let's define the <acme> tag to be an inline tag with inline contents. This is done by adding the following line to the Marpa::R2::HTML grammar configuration file:
ELE_acme is a FLO_inline included in GRP_inline
Let's take as our HTML, the following:
<acme>-during-<span>-more inline stuff-<p>-new block-
With the following shell commands, we add the new line to default.cfg, the default grammar configuration file. We then use the marpa_r2_html_fmt utility that comes with Marpa::R2 to parse the HTML.
cp default.cfg test.cfg echo "ELE_acme is a FLO_inline included in GRP_inline" >> test.cfg echo '<acme>-during-<span>-more inline stuff-<p>-new block-' | marpa_r2_html_fmt --no-added-tag --compile test.cfg
marpa_r2_html_fmt indents the HTML and adds missing tags This will show us the structure of our small HTML document. Here is what we get:
<html> <head> </head> <body> <p> <acme> -during-<span> -more inline stuff-</span></acme></p><p> -new block- </p></body> </html>
We see from this that the configuration took proper effect. Since the <acme> tag was configured as an inline element, it cannot go directly inside the the <body>. So a new <p> is created to contain the <acme> element. (A Marpa::R2::HTML configuration can also change the contents that are acceptable inside the <body>. For the moment, we'll keep it simple and accept as given that <body> contains a block flow.)
Similarly, since the <acme> tag contains an inline flow, it must end before the next <p> tag. The parser supplies an end tag for the <acme> element, and also closes the <p> paragraph that was started to hold the <acme> element.
A block element containing an inline flow
Our new configuration line can also specify that <acme> is a block element:
ELE_acme is a FLO_inline included in GRP_block
The code to test this is very similar to that displayed above. It, and the scripts for all of this post's other examples, can be found as a gist. Here's the result:
<html> <head> </head> <body> <acme> -during-<span> -more inline stuff-</span></acme><p> -new block- </p></body> </html>
Here the <acme> element is allowed in a block, so no new <p> element was needed to contain it. Since the <acme> element's contents are inline, the <acme> element still needed to be ended before the <p> tag in the actual input.
The grammar configuration file
Those who click through to look at the grammar configuration file may notice its length. Three pages. And almost half of those three pages are single-line descriptions of elements which are one or more of deprecated, obsolete or proprietary. Arguably the configuration file should be even shorter.
Other implementations of liberal HTML parsers spread the logic specifying the HTML variant across a considerably larger body of code, sometimes a vastly larger. This very much affects the cost of evolving and maintain the parser.
As for the configuration file's format at the moment: it is experimental. I can state from experience that it is quite serviceable, and fairly readable, but it could be more elegant.
Controlling element content
A block element containing a mixed flow
We can also configure the contents of the <acme> element. This configuration line specifies a mixed flow.
ELE_acme is a FLO_mixed included in GRP_block
And here is what we get:
<html> <head> </head> <body> <acme> -during-<span> -more inline stuff-</span><p> -new block- </p></acme></body> </html>
A mixed flow accepts any contents, so that the <acme> element's contents expand to include the entire body of the html document.
A block element containing a block flow
With this configuration line, we request a block flow for the contents:
ELE_acme is a FLO_block included in GRP_block
The results:
<html> <head> </head> <body> <acme> <p> -during-<span> -more inline stuff-</span></p><p> -new block- </p></acme></body> </html>
Here the <acme> element also spans the entire body of the HTML element, but because block flows are less liberal than mixed flows, the contents of the <acme> element have to be properly "packaged" inside block elements.
A block element containing PCDATA
We are getting progressively more restrictive with the contents of the <acme> element. We've already seen the example of an <acme> block element for inline contents. This configuration line specifies that the contents must be PCDATA. PCDATA allows text, but not child elements.
ELE_acme is a FLO_pcdata included in GRP_block
The result:
<html> <head> </head> <body> <acme> -during-</acme><p> <span> -more inline stuff-</span></p><p> -new block- </p></body> </html>
Note here that the <acme> tag ends as soon as another element is encountered.
An empty block element
The final restriction on the <acme> element is the insistance that it be empty:
ELE_acme is a FLO_empty included in GRP_block
And here is our result:
<html> <head> </head> <body> <acme> </acme><p> -during-<span> -more inline stuff-</span></p><p> -new block- </p></body> </html>
Displayed effects
Any of these different configurations of the <acme> tag could have a dramatic effect on what is displayed, depending on your CSS file. Whether or not <acme> is a block element also affects what is displayed. When <acme> is a block element, its boundaries will typically display as paragraph breaks.
In the above examples, the case where <acme> is configured as a block element containing PCDATA typically displays as three paragraphs:
-during- -more inline stuff- -new block-
In the other cases, the end boundary of the <acme> element varies, but always coincides with the beginning or end of other block elements, so that the visible display is as two paragraphs:
-during- -more inline stuff- -new block-
Final remarks
The configurable Marpa::R2::HTML does considerably more than could be mentioned in this post. I hope to say more about it soon. Comments on this post can be sent to the Marpa Google Group: marpa-parser@googlegroups.com
More interesting to me personally would be an implementation of the HTML 5 algorithm on top of Marpa. (As far as I know, it’s quite a bit more involved than just when to insert implied tags, though that is obviously a core aspect.) I may do that myself someday if no one beats me to it (which I always hope for). The distinct advantage of this algorithm over any other approach to parsing tag soup is that it is the way browsers will actually parse the markup. Any different algorithm for parsing tag soup is interesting to me an intellectual exercise – but not as a practical tool.
(This is not a disparagement of your work – rather an expression of frustration because I find it exciting on one hand but then also feel I’ll just not be able to use it… such a pity.)
Marpa::R2::HTML configurability extends to far more than adding new tags. It allows the re-purposing of existing ones, including the contents of <head> and <body>, two of the 3 structural tags. I'm about to upload a new blog post (probably this evening) that describes how to change the contents of the <body> tag -- something that HTML 5 requires.
When I started writing this series on HTML, I started simple to make it easier on the reader. I knew this risked giving the impression the simple stuff is all there is. Far from it.
Are there plans to make Marpa-powered backends for popular modules like HTML::TreeBuilder, and if so, how would it compare to HTML::TreeBuilder::LibXML?
@chimerix: No plans that are known to me. But with Marpa::R2 a lot of work has been done recently on portability, removing a real obstacle. In particular, Marpa::R2 now ports to Windows, which means that Marpa-powered modules can now expect to run on Windows.
If you're interested in following up, consider the Marpa Google group, marpa-parser@googlegroups.com, which is at this point Marpa's main forum.
[Note: my apologies for the delay here. By accident I left this reply in the pending queue for a month. Sorry -- jeffrey ]