High-level HTML Parsing

Marpa::HTML is a high-level HTML parser, built on top of the very high-quality HTML::Parser module. Why bother with high-level parsing, especially if it means layering one parser on top of another?

Here is an example, taken from the main document for HTML::Parser. The example prints out the title of an HTML document. To do this, HTML::Parser uses handlers which set up other handlers. One handler finds the start tag, then sets up two other handlers. (I won't reproduce that example here -- it's on CPAN. )

Here's the Marpa::HTML code for printing the title. It avoids the awkward state-driven switching of handlers.


say html(
    \$string_containging_html,
    {   'title' => sub { return Marpa::HTML::contents() },
        ':TOP'  => sub { return ( join q{}, @{ Marpa::HTML::values() } ) }
    }
);

The code is fairly transparent. html is a static method. Here it takes two arguments: a string with the html to be parsed, and a hash specifying two handlers. One handler returns the contents of title elements. Another, top-level, handler takes all the values found below it, joins them together, and returns them. (Full documentation is on CPAN.)

Finding titles is an unusually easy problem -- too easy to demonstrate the real advantage of using Marpa::HTML. The HTML::Parser solution for finding titles is as simple as it is, only because it takes advantage of the special and unusually simple properties of HTML title elements. There is only one title in a well-formed HTML document. Most HTML elements can occur more than once. Titles cannot be nested. In standard use, HTML elements are often nested to some depth.

Suppose instead you want to print out, instead of the title, all tables. In that case you need to handle multiple elements, and to deal with nested elements. Here's the Marpa::HTML solution:


say html(
    \$html,
    {   'table' => sub { return Marpa::HTML::original() },
        ':TOP'  => sub { return ( join qq{\n}, @{ Marpa::HTML::values() } ) }
    }
);

Essentially the same code as for finding titles. The Marpa::HTML solution for titles "scales" to tables. For nested tables, the above code returns the outermost table. It will return as many outermost tables as your HTML contains.

To use HTML::Parser to accomplish the same task, an application would have to keep track of the high-level HTML structure, perhaps using state logic and stacks. HTML::Parser has more of a track record, and a reader might reasonably think some additional solution complexity is tolerable for that reason. But I believe that any reader who tries the exercise of writing up an HTML::Parser solution to this problem will become willing to give Marpa::HTML a try.

The Marpa::HTML module contains more examples, and also contains documentation with full explanations of how all of this works. From that documentation, here's an example that's an especially happy match of the solution's shape to the problem's structure. This code determines the maximum element depth of an HTML document:


sub depth_below_me {
    return List::Util::max( 0, @{ Marpa::HTML::values() } );
}
my %handlers_to_calculate_maximum_element_depth = (
    q{*}   => sub { return 1 + depth_below_me() },
    ':TOP' => sub { return depth_below_me() },
);
my $maximum_depth = html( \$string_containing_html_document,
    \%handlers_to_calculate_maximum_element_depth );

HTML::Parser does its job (low-level parsing) extremely well. I've benefited not just from using it, but from studying it. But even with a time-tested module like HTML::Parser in CPAN, I think you'll find that Marpa::HTML brings additional value.

2 Comments

I need to compare with XML::DT. While XML::DT is written for XML, it supports HTML as well, and has a very similar approach.

About Jeffrey Kegler

user-pic I blog about Perl, with a focus on parsing and Marpa, my parsing algorithm based on Jay Earley's.