Repeated Capturing and Parsing

An interesting query was recently posted to the internal Perl mail list at work. The questioner was trying to match a pattern repeatedly, capturing all of the results in an array. But, it wasn't doing quite what he expected. The message, with minor edits, went a little something like the following.

I'm trying to extract key/value pairs from a file with the following contents:
- name = gcc_xo_src_clk, type = rcg
+ name = cxo_clk, type = xo, fgroup = xo, wt = 10, bloo = blah
? type = hm_mnd_rcg, name = bo : type = rcg_mn
+ name = pxo_clk

I was hoping to do something like this:

@list = $_ =~ m{ ^[-+?] \s* (\S+) \s* = \s* (\S+) \s* (?:, \s* (\S+) \s* = \s* (\S+) \s*)* }xms;

Thinking @list would be assigned the alternating key/value pairs. But the above doesn't extract anything sane. Adding the /gc modifiers doesn't make any difference.

If I do the following, it extracts the first two key/value pairs correctly (if the line has more than one pair).

@list = $_ =~ m{
    ^[-+?] \s* (\S+) \s* = \s* (\S+) \s*
    , \s* (\S+) \s* = \s* (\S+) \s*
}xms;

If I keep repeating the pattern in the second line, it keeps matching more key/value pairs.

I would expect using (?: )* should mean zero or more instances of match inside the parentheses, but obviously it's not working. What am I doing wrong?

When I'm presented with a problem like this, that is some kind of structured data, I immediately think of writing a parser. I'll get back to that in a bit, but I wanted to address the confusion about capturing in the pattern. And, in fact, that's how the discussion on the mail list proceeded.

Repeated Capturing

First, let's simplify the example to demonstrate why our seeker of wisdom wasn't getting back the list of items he expected.

my @matches = 'a b c d e' =~ /^(a) \s* (?: ([bcde]) \s* )*/xms;

say "(@matches)"; # prints "(a e)"

Capturing parentheses in Perl are treated somewhat like registers. Most Perl programmers are familiar with the $n variables, which hold the values of a successful pattern match. For example $1 holds the value matched by the first set of parentheses, $2 holds the value of the second set, and so on.

When a pattern is matched in list context, as above, it's effectively the same as writing,

'a b c d e' =~ /^(a) \s* (?: ([bcde]) \s* )*/xms;

my @matches = ( $1, $2 );

These pattern match variables are scalars and, as such, will only hold a single value. That value is whatever the capturing parentheses matched last. So, in our simplified example, $1 matches a, which is obvious enough. As the pattern repeats, $2 would be set to b, then c, and so on until the final match of e.

That explains why the pattern match wasn't returning the expected list. What can be done about it?

Capturing Along the Way

If we break down the sample data, we see that it generalizes to,

prefix key = value[, ...] [: key = value[, ...]]

The first approach that came to mind is to split the data into multiple lines. Each line can then have its initial prefix removed and saved, then parsed for its key/value pairs. That's starting to look a lot like parsing, which I promised to get to later. For the purposes of this discussion, I wanted to be able to accomplish the task with a single regular expression.

To capture all of the values we want, we need to remove the repeating set of non-capturing parentheses. However, we still need to repeat the match, ideally returning all of the captured values in one statement. We can do that with the /g and /c regular expression modifiers.

my @list = $string =~ m{ ([-+?,:]) \s* (\w+) \s* = \s* (\w+) \s* }xmsgc;

I've done two things here. First, I replaced the \S character classes, used to match the key and value, with \w. The + pattern in a Perl regular expression is greedy, so the former character class was also matching the comma used to separate key/value pairs in the data. This left the literal comma with nothing to match, so was one source of confusion.

Second, I noted that the initial prefix, while syntactically important, could be viewed in the same way as the comma and colon separators. I combined all of these separators and added a capture around them so we can later make sense of the parsed data.

When matched against the data, the pattern results in a list like,

("-", "name", "gcc_xo_src_clk", ",", "type", "rcg", "+", "name", "cxo_clk", ...)

Now we can process the data using a simple state machine.

my $state = undef;

while ( my $token = shift @list ) {
if ( $token eq '-' ) { $state = 'dash'; next; }
# ...
if ( $token eq ',' ) { next; }

my $key = shift @list;
my $value = shift @list;

if ( $state eq 'dash' ) {
# ...
}
}

Even though we did all of the data extraction using a single pattern match, it looks remarkably like ... a parser! The pattern is simply the tokenizer used to feed tokens into our state machine, the parser.

Parsing

I stated at the outset that I looked at this as a parsing problem, so the solution I would use is most likely a parser. For simple, one-off scripts, I'd use a technique similar to the one I described in the previous section. However, for more complex data or a more complex script, I'd turn to a real parser.

In fact, one of my contributions to the thread that led me to compose this post included an example of using the $^R and $^N variables in embedded code blocks to demonstrate a rudimentary parser that allowed a simulated form of capturing within a repeated non-capturing group. I won't go into any detail beyond showing what I wrote. As this was from an early point in the thread, the prefix is ignored in this example.

my @list = ();

my $kv = qr{
(\w+) (?{ $^N; }) # capture the key
\s* = \s* (\w+)
(?{ $^R = [ $^R, $^N ]; }) # capture the value, saving the key
(?{ push @list, @{ $^R } }) # push the key/value onto @list
}xms;

$data =~ m{ (?: ^[-+?] \s* $kv \s* (?:[,:] \s* $kv \s* )* )* }xms;

Fortunately for us, there are parsing modules on the CPAN.

Prior to Perl 5.10, Damian Conway had written Parse::RecDescent, but with the introduction of grammar-like facilities like named captures and named backreferences, Damian improved upon his original work and presented the Perl community with Regexp::Grammars.

What does a parser for this data built with Regexp::Grammars look like?

my $parser = qr{
    <[Line]>+

<token: Prefix> <MATCH= ([-+?]) >
<token: Key> <MATCH= (\w+) >
<token: Value> <MATCH= (\w+) >

<rule: Line> <Prefix> <Pairs> <Options>?
<rule: Pairs> <[Pair]>* % ,
<rule: Pair> <Key> = <Value>
<rule: Options> : <[Option]>* % ,
<rule: Option> <Key> = <Value>
}x;

if ( $data =~ $parser ) {
# Do something with %/
}

This is a trivial example and all the work is left to be done by inspecting the parse tree in %/. However, the module supports embedded code that will be called when a token or rule matches, which can be used to process the data as its parsed.

References

1 Comment

Maybe I'm misunderstanding the problem, but doesn't

my @matches = $data =~ m/(\w+) \s* = \s* (\w+)/xgc;

do what you want?

Leave a comment

About Chris Grau

user-pic I use Perl to support an EDA engineering organization. Occasionally I write about the things I do with Perl.