A configurable HTML parser, part 2

By Jeffrey Kegler on October 16, 2012 9:35 PM

[ I have succumbed to the sirens of static blogging. This is cross-posted from the new home of the Ocean of Awareness blog. ]

My last post introduced Marpa::R2::HTML, a configurable HTML parser. By editing a configuration file, the user can change the variant of HTML being parsed. The changes allowed are very wide ranging. The previous post started with simple changes -- the ability to specify the contents of new tags, and the context in which they can appear.

In this post the changes get more aggressive. I change the contents of an existing HTML element -- and not just any element, but one of the HTML's three "structural" elements. Marpa::R2::HTML allows the configuration file to change the contents of all pre-existing elements, with the exception of the highest level of the three structural elements: the <html> element itself.

Can text appear directly in an HTML body?

This post will discuss changing the contents of the <body> element. Fundamental to the HTML document as this element is, the definition of its contents has been very much in play.

Let's start with the question posed in the title of this section: Can text appear directly in an HTML <body> element? That is, must text inside an HTML <body> be part of one of its child elements, or can it be directly part of the contents of the <body> element?

If you want an answer strictly according to the standards, then you get your choice in the matter. According to the HTML 4.01 Strict DTD, the <body> contains a "block flow", which means that the answer is "No, text must be in the contents of a child element". Implementations of HTML were encouraged to be liberal, however, and in practice a lot of the HTML "out there" has text directly in <body> elements. Users expect their browsers to render these pages in the way that the writer intended them to look.

Recognizing existing practice, HTML 5 changed to require conforming implementations to allow text to be interspersed with the block flow, in what I call a "mixed flow". A mixed flow can directly contain blocks and text, as well as inline elements. (The inline vs. block element distinction is basic to HTML parsing. See my earlier post or the well-organized Wikipedia page on HTML elements.)

Block or mixed flow?

When parsing HTML, do you want to the treat contents of the body as a block flow or a mixed flow? Here are some of the factors.

Common practice requires accepting a mixed flow.
Cautious practice suggests writing a block flow.
HTML 4.01 requires block, but suggests being liberal.
HTML 5 requires that a mixed flow be accepted.
But HTML 5 also requires that the mixed flow be displayed as if it was written in blocks and suggests that explicit blocking be used to eliminate ambiguities.

Examples

Body contains block flow

In this first example, the <body> contains a block flow. This is what is specified in the default configuration file. Here is the pertinent line:


<body> is *block

This line says that the <body> element contains a block flow (*block). Here the star is a sigil which suggests the repetition operator of DTD's and regular expressions. (Readers of my last post will notice I've changed the configuration file syntax and will, I hope, find the new format an improvement.)

For the examples in this post, the HTML will be


I cannot wait for a start tag<p>I can

We run this through the marpa_r2_html_fmt --no-added-tag-comment. Here is the output:


<html>
  <head>
  </head>
  <body>
    <p>
      I cannot wait for a start tag</p><p>
      I can</p></body>
</html>

The first thing the parser encounters is text, which in this example is not allowed to occur directly in the body. As part of being a highly liberal HTML parser, however, Marpa::R2::HTML will supply a start tag in these situations. (This behavior, by the way, is also configurable -- a change to the configuration file can tell Marpa::R2::HTML not to do this.) With its two <p> start tags, one of them conjured up by the Ruby Slippers, Marpa::R2::HTML breezes through its input.

Body contains mixed flow

In the second example, we liberalize the contents of the <body> to allow a mixed flow:


<body> is *mixed

Here is the result:


<html>
  <head>
  </head>
  <body>
    I cannot wait for a start tag<p>
      I can</p></body>
</html>

In a mixed flow, no second <p> start tag is needed, and none is created. Its matching end tag (</p>) also does not have to be created. Otherwise, all is as before.

What I decided

Before I made my HTML parser configurable, I was forced to decide the issue of <body> contents one way or the other. My first implementation of the html_fmt utility was based on Marpa::XS and its grammar specified a mixed flow.

When I started a new version of the utility based on Marpa::R2, I reopened the issue. I decided that a stricter grammar produced a more precise parse, and that it was best to leave it up to the Ruby Slippers to "loosen things up" when the grammar was too strict. This was close, I hoped, to the best of both worlds, So I changed the grammar to specify a block flow for the contents of <body> element. This second choice -- strict block-flow-body grammar and liberal Ruby Slippers -- remains the default in the configurable version.

In current developer's releases of Marpa::R2, and in its next indexed release, both the grammar and the Ruby Slippers are configurable. The true best of both worlds happens when the user gets to decide.

Code and comments

The examples here were run using Marpa::R2 release 2.021_010. They are part of its test suite and can be found in the html/t/cfg_fmt.t file.

The configurable Marpa::R2::HTML does considerably more than can be comfortably described in a single post. This post is the second of a series. Comments on this post can be sent to the Marpa Google Group: marpa-parser@googlegroups.com

1 comment

Tagged as:

HTML, Marpa, parser

1 Comment

Toby Inkster | October 16, 2012 11:48 PM

Saying "HTML 4.01 requires block" is not 100% correct. The HTML 4.01 spec defines three different flavours of HTML: strict, transitional and frameset. The frameset document type definition is just the same as transitional, with a toggle enabled to allow frame-related elements to appear, so for the purposes of this discussion can be ignored.

While the strict document type definition does require body content to be enclosed in block elements, the transitional DTD does not. It defines the body content as "flow", meaning both block elements and inline elements are allowed. (Incidentally the flow content model is also used in various parts of the strict DTD, just not for the body element. It's used as the content model for list items, table cells and divs for instance.)

Interestingly, the form element has the same content model as body. In strict, it requires its contents to be packaged in block elements; in transitional it allows flow content.

About Jeffrey Kegler

I blog about Perl, with a focus on parsing and Marpa, my parsing algorithm based on Jay Earley's.

More info »

Ocean of Awareness