Perl Regular Expression Awesomeness

By Ingy döt Net on November 22, 2015 2:17 AM

This week at work I overheard some coworkers talking about a programming problem. The type that you might get in an interview. The idea was that if you had a string of words smushed together without spaces, how would you go about parsing the string into words again?

I thought about it for a bit and pretty quickly decided to load all of /usr/share/dict/words into some kind of regexp. The main difficultly is that you can't just be greedy or be nongreedy because either could fail. Imagine the inputs:

yougotmail          => you got mail
yougotmailed        => you got mailed
yougotmailman       => you got mailman (or: you got mail man)
yougotmailmanners   => you got mail manners

As you can see, regardless of greedy or nongreedy, you need backtracking. Hmm. Regular expressions have backtracking. Problem solved!

$list = join '|', map {chomp, $_} `cat /usr/share/dict/words`;
$input =~ /^($list)*$/;

That works! Only one little problem. How do I get the captured words? I thought I knew but I couldn't get it to work, so I asked Google. Google was not my friend, so I asked on #p5p. Fortunately the p5p regexp greats were around. Unfortunately they told me I couldn't really do that. At some point mauke++ suggested I could try putting code into a modern Perl regexp.

Long story short I came up with this Perl regexp gem: https://gist.github.com/ingydotnet/94528c938ca94f684270

You can try it out like this:

$ echo yougotmailmanners | DEBUG=1 word-parse.pl
you
got
mailman
mail
manners
you got mail manners

My favorite part of this is the local @stack = (@stack, $^N);. After each match we "push" the matched word (in $^N) onto a stack array; but we also localize the stack. This causes it to get reset to what we want when backtracking happens. That means there is no need for code to determine when a pop is needed.

I doubt this could be done much more elegantly in other languages. I'm sure that code invocation is supported in many newer language's regexp engines, but the local call-stack semantics don't exist because they are deemed inferior. I've written more Bash code than any other language in the last couple years. Bash has the same local semantics. It actually works out pretty nice most of the time.

I suspect only Perl has such a modern regexp engine and the "inferior" local semantics! :)

8 comments

8 Comments

Peter Roberts | November 22, 2015 10:30 AM | Reply

Pretty awesome, thanks for sharing :)

I'm wondering how this will this resolve phrase ambiguity? I'm assuming from the code that it will favour the result with earliest words as long as possible?

Francisco Obispo | November 23, 2015 7:51 AM | Reply

I had a similar problem not so long ago, and ended up using Regexp::Grammars

This is how it works:

#!/usr/bin/env perl
use common::sense;
use Regexp::Grammars;
use Data::Dumper;
use File::Slurp;
use Benchmark qw{timethese};

my $text = $ARGV[0];

my %dict = (
map { $_ => 1 }
grep { length( $_ ) > 1 }
map { chomp( $_ ); $_ } read_file( '/usr/share/dict/words' )
);

$dict{makeup} = 1;
$dict{porn} = 1;
$dict{experts} =1;

say "done reading file";

my $wordre = qr{
+

ysth | November 23, 2015 10:52 AM | Reply

Less efficient, but simple and works:

my @words = $input =~ /\G($list)(?=(?:$list)*\z)/g;

Ingy döt Net replied to comment from ysth | November 23, 2015 2:34 PM | Reply

Genius!

ysth, would you mind writing a couple paragraphs explaining why list capturing works in your regexp?

My best guess is that `\G` with `\g` forms some kind of `pos` loop where `$1` is returned over and over. The assertion makes certain that we parse correctly to the end in each iteration, thus triggering the needed backtracking, thus getting the proper next `$1`.

FWIW, The `\G` stuff is another regexp thing you don't see used outside of Perl. In fact (last I checked) it doesn't work with alternate re::engine implementations inside Perl. It seems like something that is half coded into the engine and half into the Perl guts.

Ingy döt Net replied to comment from Peter Roberts | November 23, 2015 2:44 PM | Reply

Peter,

I was playing around with it last night. This happened:

    echo minus the message for every tried word | perl word-parse.pl
    minus them ess age fore very tried word

I fixed it by putting the words `the` and `for` at the front of the regexp. I suppose a weighting of common words first combined with the long word weighting might yield more optimal results, but this is just an interview question, right? :)

Francisco Obispo | November 23, 2015 5:00 PM | Reply

I realized that the comment section is very small and won't fit the script. I've uploaded it to:

https://github.com/fobispo-link/tools/blob/master/words/grammar

The script is really fast, it will detect words in the dictionary.

ysth replied to comment from Ingy döt Net | November 23, 2015 5:54 PM | Reply

It's really features of perl's match operator, not the regex engine. Though I am surprised alternate engines don't handle \G; that seems like a glaring bug.

What perl's match operator does has not just the normal scalar vs list context distinction, but also (orthogonally) /g vs non-/g and capturing parentheses vs no capturing parentheses. It's worthwhile learning how all 8 resulting flavors work.

See the couple paragraphs before and the paragraph after http://perldoc.perl.org/perlop.html#\G-_assertion_

I see I could have left out the capturing parentheses; I can never keep straight /g vs non-/g list context behavior when there are no capturing parens.

Nova Patch replied to comment from Ingy döt Net | November 23, 2015 8:46 PM | Reply

\G is documented in Mastering Regular Expressions by Jeffrey Friedl, published in 2006, as being supported by Perl, .NET, and Java, as well as PHP and Ruby, with the latter two having slightly different semantics (which make them less useful than the former three). Perl’s implementation is especially powerful in that the last matching position is associated with the string and not the regex, so it can be used with multiple different regexes on the same string. I’m sure \G support among modern regex engines has changed in the last decade and would be interested in investigating for comparison. \G is notoriously under-documented in most engines.

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Ingy döt Net

I am an Acmeist Hacker. I program in many languages to meet many people. Perl people are my favorite people. Currently I am working as a Distinguished Technologist for Hewlett Packard Enterprise; developing the future of cloud solutions.

More info »

Ingy döt Net