Overlapping regex matches
irc.perl.org #perl-help
posed a good question tonight. Why does this only find some of the matches?
my $sequence = "ggg atg aaa tgt tcc cgg taa atg aat gcc cgg gaa ata tag cct gac ctg a";
$sequence =~ tr/ //d;
print "Input sequence is: $sequence \n";
while ($sequence =~ /(atg(...)*?(taa|tag|tga))/g) {print "$1 \n";}
Because, by default, regex /g begins each subsequent search after the end of the last match, so overlapping hits are not found. As this blog post explains, a negative lookahead assertion is the key to finding all of them. This works great:
while ($sequence =~ /(?=(atg.*?(taa|tag|tga)))/g) {
print "$1\n";
}
I'm partial to bioinformatics homework after 4 years of hacking on the stuff. :)
Leave a comment