A Configurable HTML Parser

[ This is cross-posted from the new home of the Ocean of Awareness blog. ]

This post introduces an HTML parser which is both liberal and configurable. Currently available as a part of a Marpa::R2 developer's release on CPAN, the new Marpa::R2::HTML allows detailed configuration of new tags and respecification of the behavior of existing tags.

To show how a configurable HTML parser works, I will start with a simple task. Let's suppose we have a new tag, call it <acme>. The older, non-configurable version of Marpa, and most browsers, would recognize this tag. But they'd simply give it a default configuration, one which is usually very liberal -- liberal meaning the tag is allowed to contain just about anything, and to go just about anywhere. A configurable parser allows us to specify the new tag's behavior more explicitly and strictly.

Block vs. inline

In this post I am going to assume that the reader knows, or will look up what he needs to know, about HTML. But block vs. inline is a essential HTML concept which is often ignored -- even the excellent Wikipedia page on HTML does not define "inline", although it uses the term in the technical sense twice. Since the concept is also central to this post, let me briefly summarize it.

Quoting from the HTML 4.01 Strict DTD,

      HTML has two basic content models:

        %inline;     character level elements and text strings
        %block;      block-like elements e.g. paragraphs and lists

There is also what I will call a "mixed flow", which can contain anything that can appear in either a block or inline flow. (What I call a mixed flow is simply called a "flow" in the HTML 4.01 DTD.)

Significant in the examples are <p>, which is a block element that contains an inline flow and <span>, which is an inline element that contains an inline flow. The <body> tag is an important structural tag, which contains an block flow.

For simplicity I am following HTML 4.01 DTD. HTML 5 uses radically different terminology and is more liberal in what it allows. Differences among standards are an important reason for an HTML parser to be configurable.

Controlling element context

An inline element containing an inline flow

Let's define the <acme> tag to be an inline tag with inline contents. This is done by adding the following line to the Marpa::R2::HTML grammar configuration file:

ELE_acme is a FLO_inline included in GRP_inline

Let's take as our HTML, the following:

<acme>-during-<span>-more inline stuff-<p>-new block-

With the following shell commands, we add the new line to default.cfg, the default grammar configuration file. We then use the marpa_r2_html_fmt utility that comes with Marpa::R2 to parse the HTML.

cp default.cfg test.cfg
echo "ELE_acme is a FLO_inline included in GRP_inline" >> test.cfg
echo '<acme>-during-<span>-more inline stuff-<p>-new block-' |
  marpa_r2_html_fmt --no-added-tag --compile test.cfg

marpa_r2_html_fmt indents the HTML and adds missing tags This will show us the structure of our small HTML document. Here is what we get:

<html>
  <head>
  </head>
  <body>
    <p>
      <acme>
        -during-<span>
          -more inline stuff-</span></acme></p><p>
      -new block-
    </p></body>
</html>

We see from this that the configuration took proper effect. Since the <acme> tag was configured as an inline element, it cannot go directly inside the the <body>. So a new <p> is created to contain the <acme> element. (A Marpa::R2::HTML configuration can also change the contents that are acceptable inside the <body>. For the moment, we'll keep it simple and accept as given that <body> contains a block flow.)

Similarly, since the <acme> tag contains an inline flow, it must end before the next <p> tag. The parser supplies an end tag for the <acme> element, and also closes the <p> paragraph that was started to hold the <acme> element.

A block element containing an inline flow

Our new configuration line can also specify that <acme> is a block element:

ELE_acme is a FLO_inline included in GRP_block

The code to test this is very similar to that displayed above. It, and the scripts for all of this post's other examples, can be found as a gist. Here's the result:

<html>
  <head>
  </head>
  <body>
    <acme>
      -during-<span>
        -more inline stuff-</span></acme><p>
      -new block-
    </p></body>
</html>

Here the <acme> element is allowed in a block, so no new <p> element was needed to contain it. Since the <acme> element's contents are inline, the <acme> element still needed to be ended before the <p> tag in the actual input.

The grammar configuration file

Those who click through to look at the grammar configuration file may notice its length. Three pages. And almost half of those three pages are single-line descriptions of elements which are one or more of deprecated, obsolete or proprietary. Arguably the configuration file should be even shorter.

Other implementations of liberal HTML parsers spread the logic specifying the HTML variant across a considerably larger body of code, sometimes a vastly larger. This very much affects the cost of evolving and maintain the parser.

As for the configuration file's format at the moment: it is experimental. I can state from experience that it is quite serviceable, and fairly readable, but it could be more elegant.

Controlling element content

A block element containing a mixed flow

We can also configure the contents of the <acme> element. This configuration line specifies a mixed flow.

ELE_acme is a FLO_mixed included in GRP_block

And here is what we get:

<html>
  <head>
  </head>
  <body>
    <acme>
      -during-<span>
        -more inline stuff-</span><p>
        -new block-
      </p></acme></body>
</html>

A mixed flow accepts any contents, so that the <acme> element's contents expand to include the entire body of the html document.

A block element containing a block flow

With this configuration line, we request a block flow for the contents:

ELE_acme is a FLO_block included in GRP_block

The results:

<html>
  <head>
  </head>
  <body>
    <acme>
      <p>
        -during-<span>
          -more inline stuff-</span></p><p>
        -new block-
      </p></acme></body>
</html>

Here the <acme> element also spans the entire body of the HTML element, but because block flows are less liberal than mixed flows, the contents of the <acme> element have to be properly "packaged" inside block elements.

A block element containing PCDATA

We are getting progressively more restrictive with the contents of the <acme> element. We've already seen the example of an <acme> block element for inline contents. This configuration line specifies that the contents must be PCDATA. PCDATA allows text, but not child elements.

ELE_acme is a FLO_pcdata included in GRP_block

The result:

<html>
  <head>
  </head>
  <body>
    <acme>
      -during-</acme><p>
      <span>
        -more inline stuff-</span></p><p>
      -new block-
    </p></body>
</html>

Note here that the <acme> tag ends as soon as another element is encountered.

An empty block element

The final restriction on the <acme> element is the insistance that it be empty:

ELE_acme is a FLO_empty included in GRP_block

And here is our result:

<html>
  <head>
  </head>
  <body>
    <acme>
    </acme><p>
      -during-<span>
        -more inline stuff-</span></p><p>
      -new block-
    </p></body>
</html>

Displayed effects

Any of these different configurations of the <acme> tag could have a dramatic effect on what is displayed, depending on your CSS file. Whether or not <acme> is a block element also affects what is displayed. When <acme> is a block element, its boundaries will typically display as paragraph breaks.

In the above examples, the case where <acme> is configured as a block element containing PCDATA typically displays as three paragraphs:

-during-

-more inline stuff-

-new block-

In the other cases, the end boundary of the <acme> element varies, but always coincides with the beginning or end of other block elements, so that the visible display is as two paragraphs:

-during- -more inline stuff-

-new block-

Final remarks

The configurable Marpa::R2::HTML does considerably more than could be mentioned in this post. I hope to say more about it soon. Comments on this post can be sent to the Marpa Google Group: marpa-parser@googlegroups.com

4 Comments

More interesting to me personally would be an implementation of the HTML 5 algorithm on top of Marpa. (As far as I know, it’s quite a bit more involved than just when to insert implied tags, though that is obviously a core aspect.) I may do that myself someday if no one beats me to it (which I always hope for). The distinct advantage of this algorithm over any other approach to parsing tag soup is that it is the way browsers will actually parse the markup. Any different algorithm for parsing tag soup is interesting to me an intellectual exercise – but not as a practical tool.

(This is not a disparagement of your work – rather an expression of frustration because I find it exciting on one hand but then also feel I’ll just not be able to use it… such a pity.)

Are there plans to make Marpa-powered backends for popular modules like HTML::TreeBuilder, and if so, how would it compare to HTML::TreeBuilder::LibXML?

About Jeffrey Kegler

user-pic I blog about Perl, with a focus on parsing and Marpa, my parsing algorithm based on Jay Earley's.