Overlapping regex matches

irc.perl.org #perl-help posed a good question tonight. Why does this only find some of the matches?

my $sequence = "ggg atg aaa tgt tcc cgg taa atg aat gcc cgg gaa ata tag cct gac ctg a"; 
$sequence =~ tr/ //d; 
print "Input sequence is: $sequence \n";  
while ($sequence =~ /(atg(...)*?(taa|tag|tga))/g) {print "$1 \n";}

Because, by default, regex /g begins each subsequent search after the end of the last match, so overlapping hits are not found. As this blog post explains, a negative lookahead assertion is the key to finding all of them. This works great:

while ($sequence =~ /(?=(atg.*?(taa|tag|tga)))/g) {
   print "$1\n";
}

I'm partial to bioinformatics homework after 4 years of hacking on the stuff. :)

Leave a comment

About Jay @ Mutation Grid

user-pic Perl / web / database development since 1995. Contact us for your next project.