a Marpa-based "Ruby Slippers"
approach to parsing liberal
and defective HTML.
As an example, let's look at a
taken more or less at random
from the middle
of the perl.org
That page is exactly 400 lines long.
Here is line 200 and some lines
lines to either side of it.
I listed the four ideas
that are essential to
This post delves into
one of them: Ruby Slippers parsing.
In Ruby Slippers parsing, the parser imagines
that the language it is parsing
to parse than it actually is.
The part of the application that handles input
manipulates the input
to make the parser's
"wishes" come true.
As an example,
take liberal HTML.
"Liberal HTML" is HTML…
or the "bare name" Marpa?
if you can,
The "bare name" Marpa is a legacy distribution,
and should be avoided by new users
and in new implementations.
incorporates all of my C language speedups.
As well as th…
"the Marpa algorithm"
What is that?
involves many details,
but the Marpa algorithm itself is basically four ideas.
Of these only the most recent is mine.
The other three come
spanning over 40 years.
Idea 1: Parse by determining which rules can be applied where
The first idea is to track the progress of the a parse by determining,
for each token, which rules can be applied and where.
Sounds pretty obvious.
Not-so-obvious is how
to do this efficiently.
The example I will use is unanchored searching for balanced parentheses.
I have claimed that many problems now tackled with regexes are better
solved with a more powerful parser, like Marpa.
I believe the numbers in this post back up that claim.
To be very clear,
I am NOT claiming that Marpa should or can replace
regexes in general.
For each character,
all an RE
(regular expression) engine needs to do
is to compute a transition from
one "state" to another state based on that character --
essentially a simple …