An HTML Pretty-printer

It's nice to get a big project to the point where it produces something which is actually useful. I'm pleased to announce html_fmt, an HTML pretty printer that's part of the Marpa::HTML distribution.

The command

    html_fmt http://perl.org

will pretty-print the HTML for http:://perl.org. Tags are printed out one per line, indented according to structure. html_fmt supplies any missing start or end tags, adding comments to that effect. html_fmt respects <pre> tags.

If the argument is not a URI, it's interpreted as a file name. Suppose, for example, that very_bad_html is a file containing this HTML: "<tr>cell data". Then html_fmt very_bad_html will convert it into this:

    <!-- Following start tag is replacement for a missing one -->
    <html>
      <!-- Following start tag is replacement for a missing one -->
      <head>
      </head>
      <!-- Preceding end tag is replacement for a missing one -->
      <!-- Following start tag is replacement for a missing one -->
      <body>
        <!-- Following start tag is replacement for a missing one -->
        <table>
          <!-- Following start tag is replacement for a missing one -->
          <tbody>
            <tr>
              <!-- Following start tag is replacement for a missing one -->
              <td>
                cell data
              </td>
              <!-- Preceding end tag is replacement for a missing one -->
            </tr>
            <!-- Preceding end tag is replacement for a missing one -->
          </tbody>
          <!-- Preceding end tag is replacement for a missing one -->
        </table>
        <!-- Preceding end tag is replacement for a missing one -->
      </body>
      <!-- Preceding end tag is replacement for a missing one -->
    </html>
    <!-- Preceding end tag is replacement for a missing one -->

html_fmt seeks to be as liberal as any HTML rendering engine you'll ever encounter. It treats all files as HTML. Occasionally even html_fmt's aggressive liberalizations of the syntax cannot make a document parse as HTML. When that happens, html_fmt tags unparseable sections of the document as "cruft". Cruft is included in the output, but is ignored for the purpose of determining the document's structure. An identifying comment is added after the cruft.

html_fmt was intended to allow easy reading of HTML, and as a diagnostic tool. It was not originally intended for reformatting. But the prettified output of html_fmt will usually render the same as the input, modulo some extra spacing, particularly around end tags. This blog post, for example, was run through an html_fmt command. A future release may reduce or eliminate spacing changes.

html_fmt uses WWW::Mechanize to fetch URI's and HTML::Parser to do the low-level HTML parsing. The high-level structure of the HTML is determined by Marpa::HTML.

I developed html_fmt as a test of Marpa. What are the advantages of using Marpa? Other high-level HTML/XHTML/XML parsers use regular expressions and/or ad-hoc methods. In some implementations these are fast, but they are not easy to maintain. Marpa is a general BNF parser generator, and Marpa::HTML is based on a BNF representation of a liberalized HTML. That makes it easy enough to change if you prefer a different liberalization of the XHTML/HTML standards.

A few additional parsing tricks are added for efficiency. These also are intuitive and are driven by tables that are straightforward to modify. In future blog posts I hope to explain in detail how Marpa::HTML parses HTML.

html_fmt's current speed is quite acceptable for looking at individual web pages. Right now Marpa is only in a "Pure Perl" implementation. More could be desired by users who need to crunch a large database of HTML. I've started on the XS version, and when that is complete I expect speeds, not equal to those of rendering engines custom-coded in C, but in the same ballpark.

Leave a comment

About Jeffrey Kegler

user-pic I blog about Marpa, my parsing algorithm, and other things of interest to techies.