November 2011 Archives

How to Parse HTML

By Jeffrey Kegler on November 30, 2011 9:12 AM

a Marpa-based "Ruby Slippers" approach to parsing liberal and defective HTML. As an example, let's look at a few lines taken more or less at random from the middle of the perl.org landing page. That page is exactly 400 lines long. Here is line 200 and some lines lines to either side of it.


</td>
<td>
<div class="module">
<a href="http://www.perlfoundation.org/">
<img alt=""
    src="http://mc-cdn.pimg.net/images/icons/onion.vee5cb98.png"
    w…

6 comments

Marpa and the Ruby Slippers

By Jeffrey Kegler on November 27, 2011 3:48 PM

In I listed the four ideas that are essential to Marpa. This post delves into one of them: Ruby Slippers parsing. In Ruby Slippers parsing, the parser imagines ("wishes") that the language it is parsing is easier to parse than it actually is. The part of the application that handles input (the "lexer") manipulates the input to make the parser's "wishes" come true.

As an example, take liberal HTML. "Liberal HTML" is HTML…

2 comments

Which Marpa distribution to use?

By Jeffrey Kegler on November 21, 2011 8:52 PM

Marpa::XS or Marpa::PP or the "bare name" Marpa? Use Marpa::XS if you can, Marpa::PP otherwise. The "bare name" Marpa is a legacy distribution, and should be avoided by new users and in new implementations.

Marpa::XS

Marpa::XS incorporates all of my C language speedups. As well as th…

8 comments

What is the Marpa algorithm?

By Jeffrey Kegler on November 18, 2011 6:09 PM

"the Marpa algorithm" many times. What is that? The implementation involves many details, but the Marpa algorithm itself is basically four ideas. Of these only the most recent is mine. The other three come from papers spanning over 40 years.

Idea 1: Parse by determining which rules can be applied where

The first idea is to track the progress of the a parse by determining, for each token, which rules can be applied and where. Sounds pretty obvious. Not-so-obvious is how to do this efficiently.

In…

5 comments

Marpa v. Perl regexes: some numbers

By Jeffrey Kegler on November 7, 2011 5:21 PM

The example I will use is unanchored searching for balanced parentheses. I have claimed that many problems now tackled with regexes are better solved with a more powerful parser, like Marpa. I believe the numbers in this post back up that claim.

To be very clear, I am NOT claiming that Marpa should or can replace regexes in general. For each character, all an RE (regular expression) engine needs to do is to compute a transition from one "state" to another state based on that character -- essentially a simple …