High-level HTML Parsing

By Jeffrey Kegler on March 28, 2010 9:43 AM

Marpa::HTML is a high-level HTML parser, built on top of the very high-quality HTML::Parser module. Why bother with high-level parsing, especially if it means layering one parser on top of another?

Here is an example, taken from the main document for HTML::Parser. The example prints out the title of an HTML document. To do this, HTML::Parser uses handlers which set up other handlers. One handler finds the start tag, then sets up two other handlers. (I won't reproduce that example here -- it's on CPAN. )

Here's the Marpa::HTML code for printing the title. It avoids the awkward state-driven switching of handlers.


say html(
    \$string_containging_html,
    {   'title' => sub { return Marpa::HTML::contents() },
        ':TOP'  => sub { return ( join q{}, @{ Marpa::HTML::values() } ) }
    }
);

The code is fairly transparent. html is a static method. Here it takes two arguments: a string with the html to be parsed, and a hash specifying two handlers. One handler returns the contents of title elements. Another, top-level, handler takes all the values found below it, joins them together, and returns them. (Full documentation is on CPAN.)

Finding titles is an unusually easy problem -- too easy to demonstrate the real advantage of using Marpa::HTML. The HTML::Parser solution for finding titles is as simple as it is, only because it takes advantage of the special and unusually simple properties of HTML title elements. There is only one title in a well-formed HTML document. Most HTML elements can occur more than once. Titles cannot be nested. In standard use, HTML elements are often nested to some depth.

Suppose instead you want to print out, instead of the title, all tables. In that case you need to handle multiple elements, and to deal with nested elements. Here's the Marpa::HTML solution:


say html(
    \$html,
    {   'table' => sub { return Marpa::HTML::original() },
        ':TOP'  => sub { return ( join qq{\n}, @{ Marpa::HTML::values() } ) }
    }
);

Essentially the same code as for finding titles. The Marpa::HTML solution for titles "scales" to tables. For nested tables, the above code returns the outermost table. It will return as many outermost tables as your HTML contains.

To use HTML::Parser to accomplish the same task, an application would have to keep track of the high-level HTML structure, perhaps using state logic and stacks. HTML::Parser has more of a track record, and a reader might reasonably think some additional solution complexity is tolerable for that reason. But I believe that any reader who tries the exercise of writing up an HTML::Parser solution to this problem will become willing to give Marpa::HTML a try.

The Marpa::HTML module contains more examples, and also contains documentation with full explanations of how all of this works. From that documentation, here's an example that's an especially happy match of the solution's shape to the problem's structure. This code determines the maximum element depth of an HTML document:


sub depth_below_me {
    return List::Util::max( 0, @{ Marpa::HTML::values() } );
}
my %handlers_to_calculate_maximum_element_depth = (
    q{*}   => sub { return 1 + depth_below_me() },
    ':TOP' => sub { return depth_below_me() },
);
my $maximum_depth = html( \$string_containing_html_document,
    \%handlers_to_calculate_maximum_element_depth );

HTML::Parser does its job (low-level parsing) extremely well. I've benefited not just from using it, but from studying it. But even with a time-tested module like HTML::Parser in CPAN, I think you'll find that Marpa::HTML brings additional value.

2 comments

2 Comments

Alberto Simões | March 28, 2010 10:55 AM

I need to compare with XML::DT. While XML::DT is written for XML, it supports HTML as well, and has a very similar approach.

Jeffrey Kegler replied to comment from Alberto Simões | March 28, 2010 5:52 PM

I'll be interested in your comments. Marpa::HTML is now targeted primarily at HTML, and in particular seeks to be useful where the parser has to be very liberal in the HTML it accepts. It is essentially targeting the opposite end of the parsing difficulty spectrum from XML.

An inconvenience for someone trying to parse XML using Marpa::HTML would be that the current version only implements the default settings of HTML::Parser for its low-level parsing, and those defaults are not right for XML. As an example, HTML::Parser by default treats tags as case-insensitive, which is right for HTML, but not right for XML. HTML::Parser itself allows you to configure this behavior, but Marpa::HTML does not yet support this.

I targeted liberal XHTML/HTML before XML, because I wanted to showcase the ability of the Marpa parser to handle difficult grammars and parsing situations. XML is a very well behaved language, and submits nicely to previous parsing techniques.

About Jeffrey Kegler

I blog about Perl, with a focus on parsing and Marpa, my parsing algorithm based on Jay Earley's.

More info »

Ocean of Awareness