A Marpa use case: JavaScript Regexp Implementation

Title

Introduction

Back working on the next version of MarpaX::Languages::ECMAScript::AST, one important missing thing was implementing in Perl the equivalent of JavaScript Regular Expressions Engine. Despite JavaScript borrowed the perl's syntax, its semantics behave differently, and one should never copy a regular expression from Perl into JavaScript blindly.

For instance, a vicious difference is that a capture value in JavaScript is the last match. While in Perl this is the last successful match. Consequence can be hard to track down, for example, this regexp:

 /(z)((a+)?(b+)?(c))*/

applied on this input:

 "zaacbbbcac"

will capture, in JavaScript:

 <z> <ac> <a> <undefined> <c>

while perl will say:

 <z> <ac> <a> <bbb> <c>

Finally, a nice exercice and the occasion to show what can be achieved with Marpa::R2 Scanless Interface.

Implementation

ECMAScript specification has a nice writeup of its regular expression grammar and semantics, and it appears that the grammar itself is simple (c.f. the _DATA__ section):

A G1 (i.e. rule) section
 Pattern ::=
      Disjunction                             action => ...

 Disjunction ::=
      Alternative                             action => ...
    | Alternative '|' Disjunction             action => ...

 etc...
A L0 (i.e. lexer) section
 IdentityEscape ~
       [\p{IsIdentityEscape}]

 etc...

 :lexeme ~ <LPAREN_ATOM_DISJUNCTION>
    pause => after
    event => 'LPAREN_ATOM_DISJUNCTION$'
LPAREN_ATOM_DISJUNCTION ~ '('

The interesting thing for the end-user is here in the L0 section:

a reminder of the Perl's very powerful User-Defined Character Properties

This also show that everything that Perl understand as a character class is understanble by Marpa::R2's Scanless Interface.

the notion of "pause" after

Every time Marpa::R2 will successfully identify token, you can instruct it to pause, and you are back into perl userspace. This simple technique allows, for instance, to track all the disjunctions that will go into the equivalent of perl's $1, $2, etc.

On the other hand, the semantics are a bit tedious. The values of most rules are closures that are calling other closures, and so on. Nothing particular since of course we expect Marpa::R2 to consider every value of every rule as an opaque thing, but a big thanks to perl to allow to have multilevel closures...!

The final value is always in the form [lastIndex, [@matches]], since, finally, the important things are case-insentivity and multiline properties. The notion of "global", as showed later in the ECMAScript specification, is a decoration on top of it, basically doing retries at higher indexes in the input string you want to match against.

After all, this exercice is giving a perl implementation of the JavaScript Regular Expression Engine.

Leave a comment

About Jean-Damien Durand

user-pic About::Me::And::Perl