Parse::RecDescent and number of elements read on the fly

I recently had to develop a small parser for some coworkers and I turned to Parse::RecDescent for handling it. The grammar was not particularly difficult to address, but it had some fields that behave like arrays whose number of elements is declared dynamically, so it's not generally possible to use the repetition facilities provided by Parse::RecDescent, because they require the number of repetitions to be known beforehand.

(Yes, I learnt to use BODY and EXTENDED in Movable Type...)

I somehow figured out a solution, that I'll explain shortly, but then I thought that it might already be inside Parse::RecDescent::FAQ so I tried to see something there. Actually there's something about, pointing to this thread in Perl Monks, but I don't like the solutions very much (possibly because Parse::RecDescent evolved in the mean time, I don't know).

The trick I'm using deals with modifying the variable $text inside actions. Modifications to this variable are kept if the match is successful and ignored otherwise, so it allows taking parts of the input that have to be dealt with rules that cannot be described in some other ways.

We will imagine that our input data is shaped like this:

  • plain strings are represented as Sn(...), where n indicates the number of characters (octets in our case, just to make things simple) and ... represents the n characters;
  • arrays of stuff are represented similarly but with an initial A character

An example string might be the following:

A2(A2(S3(Hey)S13(Hello, World!))S5(Ciao!))

which should be represented like this in Perl:

    [
      [
        'Hey',
        'Hello, World!'
      ],
      'Ciao!'
    ];

Dealing with the strings is straightforward, as the following grammar shows:

length: /\d+/

string: 'S' length '('
   {
      $return = substr $text, 0, $item{length}, '';
      1;
   }
   ')'

In the action, we get the required substring directly from $text, deleting it on the fly (note that we're using the four-parameters variant of substr, so the text that is selected is also substituted with the fourth parameter, an empty string in our case). Here's an example driver program with the grammar above:

#!/opt/perl/bin/perl
use strict;
use warnings;
use Parse::RecDescent;
use Data::Dumper;

my $grammar = <<'END_OF_GRAMMAR';
length: /\d+/

string: 'S' length '('
   {
      $return = substr $text, 0, $item{length}, '';
      1;
   }
   ')'
END_OF_GRAMMAR

my $parser = Parse::RecDescent->new($grammar);
my $res = $parser->string('S13(Hello, World!)');
print {*STDOUT} Dumper($res);

where the result is quite obvious:

$VAR1 = 'Hello, World!';

Things get a bit more difficult when dealing with arrays of stuff, either strings or other arrays. In this case we are not dealing with characters directly in $text and additionally there can be different elements mixed together. In this case, our first move is to define a generic rule to include every possible thing that can get as an array element, i.e. an any rule:

any: string | array

The array will then be defined in terms of a sequence of any elements; we will have to iterate explicitly inside an action:

array: 'A' length '('
   {
      my @sequence;
      for my $i (1 .. $item{length}) {
         my $item = get_next_item_and_chip_off_from($text);
         push @sequence, $item;
      }
      $return = \@sequence;
   }
   ')'

The funny function with the long name says what we have to do next, which is two things actually: parse the current $text to look for an any value and then chip the corresponding string off of $text, so that we can go on with the next item or with the end of the array.

It's easy to address the first problem, because we have a reference to the parser available in the actions:

my $item = $thisparser->any($text);

You are guessing it right: this call allows you to perform a sub-parse of another string while parsing the whole thing. You will have this kind of recursion for every nested level in the input data.

Alas, this is not going to tell us how much input has been consumed from this sub-parsing, so we still have the problem of figuring this out. A consideration helps here: every action in a rule knows what the remaining $text is, so it is sufficient to get it as well as the element we want. This is easily solved with a simple wrapper that will give us both:

any: string | array
any_with_remaning: any
   {
      $return = {
         value => $item{any},
         remaining => $text,
      };
   }
array: 'A' length '('
   {
      my @sequence;
      for my $i (1 .. $item{length}) {
         my $record = $thisparser->any_with_remaining($text);
         push @sequence, $record->{value};
         $text = $record->{remaining};
      }
      $return = \@sequence;
   }
   ')'

Nice and straightforward: our wrapper rule returns a value to put in the sequence, and the text that remains to be parsed. It is sufficient to set $text to the latter in order to go on with parsing of the rest (either another element of the array or the closing parenthesis).

This approach involves a bit of copying strings around, so it might be a bit too heavy with big inputs. To work this around we can just record how many characters are extracted when calling any, and pass this number back to array in order to perform a substr and avoid copying strings:

any: string | array
any_with_offset: <rulevar: $length> # $thisoffset not reliable
any_with_offset: { $length = length($text) } any
   {
      $return =  {
         value  => $item{any},
         offset => $length - length($text),
      };
   }
array: 'A' length '('
   {
      my @sequence;
      for my $i (1 .. $item{length}) {
         my $record = $thisparser->any_with_offset($text);
         push @sequence, $record->{value};
         substr $text, 0, $record->{offset}, '';
      }
      $return = \@sequence;
   }
   ')'

The augmented any_with_offset now returns the value and the length of the input string that was actually consumed by any. Note that the offset is calculated using the $length variable. It is initialised before any parsing is attempted with this rule, so that we are sure to get the length of the whole input; after matching the rule any, the local $text will contain only the remaining text, so that we can calculate how many characters were consumed by difference.

The array rule is then modified to use this offset and perform some cutting on $text inside the rule itself, ready for the next iteration or for closing the array.

Some clever folks might observe that the value calculated for $length should be the same as the variable $thisoffset. I thought this in the first time, but time proved me wrong (and I don't know why...), so I had to implement the explicit calculation above.

The following program shows the final grammar and some driver code to test it:

use strict;
use warnings;
use Parse::RecDescent;
use Data::Dumper;

my $grammar = <<'END_OF_GRAMMAR';
length: /\d+/

string: 'S' length '('
   {
      $return = substr $text, 0, $item{length}, '';
      1;
   }
   ')'

any: string | array
any_with_offset: <rulevar: $length> # $thisoffset not reliable
any_with_offset: { $length = length($text) } any
   {
      $return =  {
         value  => $item{any},
         offset => $length - length($text),
      };
   }

array: 'A' length '('
   {
      my @sequence;
      for my $i (1 .. $item{length}) {
         my $record = $thisparser->any_with_offset($text);
         push @sequence, $record->{value};
         substr $text, 0, $record->{offset}, '';
      }
      $return = \@sequence;
   }
   ')'
END_OF_GRAMMAR

my $parser = Parse::RecDescent->new($grammar);
my $res = $parser->any('A2(A2(S3(Hey)S13(Hello, World!))S5(Ciao!))');
print {*STDOUT} Dumper($res);

The output is the following, as expected:

$VAR1 = [
          [
            'Hey',
            'Hello, World!'
          ],
          'Ciao!'
        ];

In conclusion, getting a number of elements that is dynamically read from the input is definitely doable, even though it requires a bit of extra effort.

4 Comments

I wonder if using Marpa as parser instead of Parse::RecDescent would make it easier...

On this particular problem, Marpa and Parse::RecDescent are basically on equal terms, and it comes down to outside factors (XS versus pure Perl, core module vs. new module, etc., etc.) I put a few quick thoughts here.

Leave a comment

About Flavio Poletti

user-pic