How to Parse HTML

This is the first of a series of posts that will detail a Marpa-based "Ruby Slippers" approach to parsing liberal and defective HTML. As an example, let's look at a few lines taken more or less at random from the middle of the perl.org landing page. That page is exactly 400 lines long. Here is line 200 and some lines lines to either side of it.


</td>
<td>
<div class="module">
<a href="http://www.perlfoundation.org/">
<img alt=""
    src="http://mc-cdn.pimg.net/images/icons/onion.vee5cb98.png"
    width="45" height="45" />
</a>

<h4>
<a href="http://www.perlfoundation.org/">The Perl Foundation</a>
</h4>
<p>
The Perl Foundation is dedicated to the advancement
of the Perl programming language through open discussion,
collaboration, design, and code.

For readability, I've rearranged the whitespace, but otherwise the above is untouched. My more-or-less-random extract is part of a table, and captures the end tag of one cell and the beginning of another.

Marpa::HTML has no trouble fixing this up, but neither does the rendering engine in Firefox. If you cut and paste the above display into a file, and look at it in your favorite web browser, the result will probably be quite reasonable. So why am I saying that Marpa-based HTML parsing is a step forward?

What's the Big Deal?

Marpa::HTML as it sits is a flexible and useable tool, but what makes it different is best appreciated if you think in terms of writing, maintaining or forking Marpa::HTML. The rendering engine in your favorite browser is a monolith of mission-specific code, and in most cases maintaining it is the job of a team. Marpa::HTML was written in a couple of geek-weeks. It is reasonable to expect that even if a change amounted nearly to a total rewrite, it could done in a similar amount of time.

Why is Marpa::HTML that much easier to code? Well, admittedly, Marpa::HTML does not do rendering, which makes life easier. Marpa::HTML just parses HTML, without trying to figure out how to arrange it on a display.

More to our purpose, Marpa::HTML divides HTML parsing into two layers: an HTML specific layer; and a general parsing layer. Most of the complexity goes into the general parsing layer, which is in carefully optimized C, and which contains no HTML-specific code. The HTML-specific layer is small and coded in Perl.

When I started writing Marpa::HTML, my general parsing layer was already written, tested and ready to go. I only needed to write an HTML-specific layer. Marpa::HTML was my first serious exercise of the Ruby Slippers, and my surprise at how easy it was to use inspired the name.

The advantages to breaking up an HTML parser into an HTML-specific layer and a parse engine can be compared to the advantages that were accrued by breaking up the original monolithic web browsers into a user interface and a rendering engine. The interface of modern browsers can be changed without hacking the rendering engine. The Marpa-powered approach to parsing HTML allows the programmer to completely change his approach to HTML without touching the parse engine.

As you read this series of posts, I hope the following will be food for the imagination:
  • Marpa::HTML could be the basis of a utility. Two I have already written are html_fmt and html_score.
  • A Marpa-powered tool could take a customized approach to dealing with defects in HTML.
  • A Marpa-powered tool could take a configurable approach to dealing with defects in HTML.
  • A Marpa-powered tool could provide an XS-powered engine to speed up HTML::Tree.
  • A Marpa-powered tool could understand embedded content, such as the HTML-within-HTML used in the displays in this blog post.
  • A Marpa-powered tool could imitate your favorite renderer.
  • A Marpa-powered tool could configurably imitate the properties of many renderers.
  • A Marpa-powered tool could prototype the HTML renderer of your dreams.
  • With its HTML-specific layer recoded in C, a fork could be the easier-to-maintain HTML renderer of your dreams.

The Example

I will finish this post, by taking a first look at what Marpa::HTML does. For that, it is convenient to use html_fmt, my Marpa-powered HTML "pretty-printer". html_fmt takes any file fed to it, interprets it as HTML, and prints a "prettified" version to the standard output. It's a good tool for studying HTML interpretation. Here is what it did with our example:


<!-- Following start tag is replacement for a missing one -->

<html>
  <!-- Following start tag is replacement for a missing one -->
  <head>
  </head>
  <!-- Preceding end tag is replacement for a missing one -->
  <!-- Following start tag is replacement for a missing one -->

  <body>
    <!-- Following start tag is replacement for a missing one -->
    <table>
      <!-- Following start tag is replacement for a missing one -->
      <tbody>
        <!-- Following start tag is replacement for a missing one -->

        <tr>
          <!-- Next line is cruft -->
          </td>
          <td>
            <div class="module">
              <a href="http://www.perlfoundation.org/">

                <img alt=""
    src="http://mc-cdn.pimg.net/images/icons/onion.vee5cb98.png"
    width="45" height="45" />
              </a>
              <h4>
                <a href="http://www.perlfoundation.org/">
                  The Perl Foundation
                </a>
              </h4>

              <p>
                The Perl Foundation is dedicated to the advancement
                of the Perl programming language through open discussion,
                collaboration, design, and code.
              </p>
              <!-- Preceding end tag is replacement for a missing one -->
            </div>
            <!-- Preceding end tag is replacement for a missing one -->
          </td>

          <!-- Preceding end tag is replacement for a missing one -->
        </tr>
        <!-- Preceding end tag is replacement for a missing one -->
      </tbody>
      <!-- Preceding end tag is replacement for a missing one -->
    </table>

    <!-- Preceding end tag is replacement for a missing one -->
  </body>
  <!-- Preceding end tag is replacement for a missing one -->
</html>
<!-- Preceding end tag is replacement for a missing one -->

Whenever html_fmt adds a tag or spots a spurious tag ("cruft") it adds a comment to that effect. As you can see from the above, html_fmt sees that the extract is a table fragment and builds a table around it. An interesting exercise is to take both the example and the html_fmt output and look at them in your favorite browser, comparing html_fmt's reconstruction from the HTML fragment with your browser's.

In posts to come, I'll go into detail about the Ruby Slippers, Marpa-powered, approach to HTML parsing.

Notes

  1. "forking": To be sure, I consider Marpa::HTML a fine tool, and use it often in my own work. And it is possible that I will enhance Marpa::HTML. But a parser like Marpa is a tool. Inventing a new hammer does not evince a desire to single-handedly build every house in the world.

6 Comments

If you'd like to add an example to http://htmlparsing.com/perl.html I'd welcome it.

If that corrected html output is copy/pasted, you have a bug to hunt down. Because the next fragment most definitely isn't correct.

        <tr>
          <!-- Next line is cruft -->
          </td>

About Jeffrey Kegler

user-pic I blog about Perl, with a focus on parsing and Marpa, my parsing algorithm based on Jay Earley's.