Marpa Hint: matching an integer
I will not reinvent the wheel and take the exact definition from the C grammar in its Lex format:
{HP}{H}+{IS}? { return I_CONSTANT; } {NZ}{D}*{IS}? { return I_CONSTANT; } "0"{O}*{IS}? { return I_CONSTANT; } {CP}?"'"([^'\\\n]|{ES})+"'" { return I_CONSTANT; }How does it translate into Marpa grammar ? The first question always have to be: am I talking about a terminal or a non-terminal ? Here obviously the answer is: a terminal. So for Marpa this will be a lexeme. With Marpa a lexeme is the "bridge" between high-level rules (the usual "grammar") and low-level rules (how are defined the terminals).
Once a lexeme appears in the high-level rules (G1 in Marpa terminology), it cannot appear in the RHS (Right-Hand-Side) of a low-level rule (G0 in Marpa). The distinction between G0 and G1 rules is the separator between the LHS (Left-Hand-Side) and the RHS, i.e. ::=
for G1, ~
for G0.
Back to I_CONSTANT
. It belongs to G0 so one would like to write is as: I_CONSTANT ~ something
.
We come now to the sequences: Marpa has '*' and '+', with their usual meanings. Nevertheless, no '?'. Furthermore you cannot mix sequences with non-sequences on an RHS. There are good reasons for such choices, subscribe or search the marpa-parser group for more.
So let's translate the first and the last lines, the others are a mix of them. I just put on the RHS the Lex definitions first:
I_CONSTANT ~ {HP}{H}+{IS}? I_CONSTANT ~ {CP}?"'"([^'\\\n]|{ES})+"'"I remind that G0 is a grammar, so this Lex notation naturally is now:
I_CONSTANT ~ HP H+ IS? I_CONSTANT ~ CP? "'" ([^'\\\n]|{ES})+ "'"
Remember: no mix of sequences with non-sequences in the RHS, i.e.:
I_CONSTANT ~ HP H_many IS_maybe I_CONSTANT ~ CP_maybe "'" ([^'\\\n]|{ES})_many "'"
It looks natural that _many is not correct. Indeed, we come to one very important subitility with Marpa: the () notation does NOT mean a group. it means that everything inside is masked from the final parse tree value point of view. It is part of the grammar. It has no impact on the value produced by the grammar.
So let's write a separate rule for that:
I_CONSTANT_INSIDE ~ [^'\\\n] | ES I_CONSTANT_INSIDE_many ~ I_CONSTANT_INSIDE+
One remark about character classes: they follow perl notation. So [^'\\\n]
is valid. As well as [\p{someClass}]
, etc... Ok, our examples become:
I_CONSTANT ~ HP H_many IS_maybe I_CONSTANT ~ CP_maybe "'" I_CONSTANT_INSIDE "'"
Still one last thing is not correct from Marpa point of view: constant strings are always writen within single quotes, and there is no notion of backslash in Marpa constant strings. So a quote itself can only be writen as a character class:
QUOTE ~ [']
which gives:
I_CONSTANT ~ HP H_many IS_maybe I_CONSTANT ~ CP_maybe QUOTE I_CONSTANT_INSIDE QUOTE
Here we go! This is a valid Marpa grammar. Just repeat the exercice on HP, H, IS, etc.. I give below the final result. On the command-line, , you would get e.g.:
% perl /tmp/I_CONSTANT.pl 123 456 7.89 101112 123 => 123 456 => 456 Error in SLIF parse: Parse exhausted, but lexemes remain, at line 1, column 2* String before error: 7
* The error was at line 1, column 2, and at character 0x002e '.', ...
* here: .89
Marpa::R2 exception at /tmp/I_CONSTANT.pl line 11.7.89 => <undef>
101112 => 101112
Noticed the verbose, precise and meaninful error message? This is the other great power of Marpa, I'll talk about that in another post. Trust me, if you want to know, debug, inspect what happens... Marpa is your best friend.
The source:
#!env perl use strict; use diagnostics; use Marpa::R2;my $grammar_source = do {local $/; <DATA>};
my $g = Marpa::R2::Scanless::G->new({source => \$grammar_source});
foreach my $this (@ARGV) {
my $r = Marpa::R2::Scanless::R->new({grammar => $g});
eval {
$r->read(\$this);
my $value = ${$r->value};
print "$this\t=>\t$value\n";
};
if ($@) {
print "$@\n$this\t=>\t<undef>\n" if ($@);
}}
__DATA__
:start ::= I_CONSTANTI_CONSTANT ~ HP H_many IS_maybe
| NZ D_any IS_maybe
| '0' O_any IS_maybe
| CP_maybe QUOTE I_CONSTANT_INSIDE_many QUOTEBS ~ '\'
CP ~ [uUL]
CP_maybe ~ CP
CP_maybe ~
D ~ [0-9]
D_any ~ D*
ES_AFTERBS ~ [\'\"\?\\abfnrtv]
| O
| O O
| O O O
| 'x' H_many
ES ~ BS ES_AFTERBS
HP ~ '0' [xX]
H ~ [a-fA-F0-9]
H_many ~ H+
I_CONSTANT_INSIDE ~ [^'\\\n] | ES
I_CONSTANT_INSIDE_many ~ I_CONSTANT_INSIDE+
IS ~ U LL_maybe | LL U_maybe
IS_maybe ~ IS
IS_maybe ~
LL ~ 'll' | 'LL' | [lL]
LL_maybe ~ LL
LL_maybe ~
NZ ~ [1-9]
O ~ [0-7]
O_any ~ O*
QUOTE ~ [']
U ~ [uU]
U_maybe ~ U
U_maybe ~
Leave a comment