a Marpa-based "Ruby Slippers"
approach to parsing liberal
and defective HTML.
As an example, let's look at a
few lines
taken more or less at random
from the middle
of the perl.org
landing page.
That page is exactly 400 lines long.
Here is line 200 and some lines
lines to either side of it.
</td>
<td>
<div class="module">
<a href="http://www.perlfoundation.org/">
<img alt=""
src="http://mc-cdn.pimg.net/images/icons/onion.vee5cb98.png"
w…
In
I listed the four ideas
that are essential to
Marpa.
This post delves into
one of them: Ruby Slippers parsing.
In Ruby Slippers parsing, the parser imagines
("wishes")
that the language it is parsing
is easier
to parse than it actually is.
The part of the application that handles input
(the "lexer")
manipulates the input
to make the parser's
"wishes" come true.
As an example,
take liberal HTML.
"Liberal HTML" is HTML…
Marpa::XS
or
Marpa::PP
or the "bare name" Marpa?
Use
Marpa::XS
if you can,
Marpa::PP
otherwise.
The "bare name" Marpa is a legacy distribution,
and should be avoided by new users
and in new implementations.
Marpa::XS
Marpa::XS
incorporates all of my C language speedups.
As well as th…
"the Marpa algorithm"
many times.
What is that?
The
implementation
involves many details,
but the Marpa algorithm itself is basically four ideas.
Of these only the most recent is mine.
The other three come
from papers
spanning over 40 years.
Idea 1: Parse by determining which rules can be applied where
The first idea is to track the progress of the a parse by determining,
for each token, which rules can be applied and where.
Sounds pretty obvious.
Not-so-obvious is how
to do this efficiently.
In…
The example I will use is unanchored searching for balanced parentheses.
I have claimed that many problems now tackled with regexes are better
solved with a more powerful parser, like Marpa.
I believe the numbers in this post back up that claim.
To be very clear,
I am NOT claiming that Marpa should or can replace
regexes in general.
For each character,
all an RE
(regular expression) engine needs to do
is to compute a transition from
one "state" to another state based on that character --
essentially a simple …