November 2011 Archives

How to Parse HTML

a Marpa-based "Ruby Slippers" approach to parsing liberal and defective HTML. As an example, let's look at a few lines taken more or less at random from the middle of the perl.org landing page. That page is exactly 400 lines long. Here is line 200 and some lines lines to either side of it.


</td>
<td>
<div class="module">
<a href="http://www.perlfoundation.org/">
<img alt=""
    src="http://mc-cdn.pimg.net/images/icons/onion.vee5cb98.png"
    w…

Marpa and the Ruby Slippers

In I listed the four ideas that are essential to Marpa. This post delves into one of them: Ruby Slippers parsing. In Ruby Slippers parsing, the parser imagines ("wishes") that the language it is parsing is easier to parse than it actually is. The part of the application that handles input (the "lexer") manipulates the input to make the parser's "wishes" come true.

As an example, take liberal HTML. "Liberal HTML" is HTML…

Which Marpa distribution to use?

Marpa::XS or Marpa::PP or the "bare name" Marpa? Use Marpa::XS if you can, Marpa::PP otherwise. The "bare name" Marpa is a legacy distribution, and should be avoided by new users and in new implementations.

Marpa::XS

Marpa::XS incorporates all of my C language speedups. As well as th…

What is the Marpa algorithm?

"the Marpa algorithm" many times. What is that? The implementation involves many details, but the Marpa algorithm itself is basically four ideas. Of these only the most recent is mine. The other three come from papers spanning over 40 years.

Idea 1: Parse by determining which rules can be applied where

The first idea is to track the progress of the a parse by determining, for each token, which rules can be applied and where. Sounds pretty obvious. Not-so-obvious is how to do this efficiently.

In…

Marpa v. Perl regexes: some numbers

The example I will use is unanchored searching for balanced parentheses. I have claimed that many problems now tackled with regexes are better solved with a more powerful parser, like Marpa. I believe the numbers in this post back up that claim.

To be very clear, I am NOT claiming that Marpa should or can replace regexes in general. For each character, all an RE (regular expression) engine needs to do is to compute a transition from one "state" to another state based on that character -- essentially a simple …

About Jeffrey Kegler

user-pic I blog about Perl, with a focus on parsing and Marpa, my parsing algorithm based on Jay Earley's.