my @words = $input =~ /\G($list)(?=(?:$list)*\z)/g;
ysth, would you mind writing a couple paragraphs explaining why list capturing works in your regexp?
My best guess is that `\G` with `\g` forms some kind of `pos` loop where `$1` is returned over and over. The assertion makes certain that we parse correctly to the end in each iteration, thus triggering the needed backtracking, thus getting the proper next `$1`.
FWIW, The `\G` stuff is another regexp thing you don't see used outside of Perl. In fact (last I checked) it doesn't work with alternate re::engine implementations inside Perl. It seems like something that is half coded into the engine and half into the Perl guts.
]]>I was playing around with it last night. This happened:
echo minus the message for every tried word | perl word-parse.pl minus them ess age fore very tried word
I fixed it by putting the words `the` and `for` at the front of the regexp. I suppose a weighting of common words first combined with the long word weighting might yield more optimal results, but this is just an interview question, right? :)
]]>What perl's match operator does has not just the normal scalar vs list context distinction, but also (orthogonally) /g vs non-/g and capturing parentheses vs no capturing parentheses. It's worthwhile learning how all 8 resulting flavors work.
See the couple paragraphs before and the paragraph after http://perldoc.perl.org/perlop.html#\G-_assertion_
I see I could have left out the capturing parentheses; I can never keep straight /g vs non-/g list context behavior when there are no capturing parens.
]]>\G
is documented in Mastering Regular Expressions by Jeffrey Friedl, published in 2006, as being supported by Perl, .NET, and Java, as well as PHP and Ruby, with the latter two having slightly different semantics (which make them less useful than the former three). Perl’s implementation is especially powerful in that the last matching position is associated with the string and not the regex, so it can be used with multiple different regexes on the same string. I’m sure \G
support among modern regex engines has changed in the last decade and would be interested in investigating for comparison. \G
is notoriously under-documented in most engines.]]>