What! No Lexer?

To those who have noted that Marpa::XS does not come with a lexer, I'd respond that, in a very real sense it does -- Perl. Perl5 is a powerful lexical analyzer.

If you're trying to figure out how to write your first Marpa parser, I'd recommend a close look at Wolfgang Kinkeldei's recent posting about his Marpa-powered CSS parser. Wolfgang lays his parser out in a very elegant fashion, and I find his code makes an excellent template.

Especially nice-looking is Wolfgang's lexer. Wolfgang follows one of the two main strategies for lexical analysis in Perl: he consumes the input using substitution (s/ ... / ... /) commands.

The other strategy is to use the Perl regex search position to track the progress of the lexical analysis. In the search-position strategy, your cases consist of a lot of match commands using the \G anchor and the gc modifier: m/\G ... /gc. An excellent tutorial on this kind of lexing, albeit in a non-Marpa context, can be found in Mark Jason Dominus's book, Higher Order Perl. Mark's coverage of lexing is in Chapter 8, "Parsing", on pages 359-375. Mark's book can be read on-line. I highly recommend Mark's book and own a paper copy.

Actually, regular expressions are well within Marpa's capabilites, and lexical analysis could be done in Marpa. But a look at Mark and Wolfgang's code should convince you that lexical analysis is easy to do in Perl.

4 Comments

Thanks for the flowers. The code snippet you are referring to only is a very quick and dirty first try I made using Marpa. Nice to hear that you like the Lexer. Howerver using this approach it is hard to tell the line number of the source code where an error occurred. But we would not use Perl if no solution existed... By expanding the regex patterns like

$text =~ m{\G \s* ([~|]?=) \s*}xmsgc and do { ... };

the source-text will not get destroyed and the position where the last-matching regex stopped can get queried using

pos($text)

This makes error reporting possible.

I like Marpa very much and experimented even further. The scanner I am currently using is part of an experiment trying to read SCSS-Syntax which is a superset of CSS. It can be found here:

https://github.com/wki/CSS-SCSS/blob/master/lib/CSS/SCSS/Parser/Scanner.pm

Thanks for having created Marpa.

Well, I don't know if there is language like BNF / ABNF for describing tokens, or whether BNF can be used for that, but automatic generation of lexer from rules is what I meant when asking about lexer.

@Jeffrey: Thanks for response.

What would be nice to have is to have among Marpa documentation full example of generating lexer and parses, for example out of description in some Internet RFC (email address perhaps?).

BTW. I wonder how hard would be to write ABNF to Marpa parser using Marpa...

About Jeffrey Kegler

user-pic I blog about Perl, with a focus on parsing and Marpa, my parsing algorithm based on Jay Earley's.