Cleaning up the IDs in a FASTA file

I have some FASTA files with headers like this:

>gi|83274083|ref|AC_000032.1| Mus musculus strain mixed chromosome 10, alternate assembly Mm_Celera, whole genome shotgun sequence

I wanted to extract just the 2nd field, so here's a Perl 6 script to do that:

#!/usr/bin/env perl6

use File::Temp;

sub MAIN (*@files) {
my $i = 0;
for @files -> $file {
my ($tmpfile, $tmpfh) = tempfile();
printf "%3d: %s\n", ++$i, $file.IO.basename;
for $file.IO.lines -> $line {
if $line.substr(0,1) eq '>' {
my @flds = $line.split('|');
$tmpfh.print(">" ~ @flds[1] ~ "\n");
}
else {
$tmpfh.print("$line\n");
}
}

$tmpfh.close;
move $tmpfile, $file;
}
}

put "Done.";

6 Comments

Hello kyclark,

Thanks you for this post, Today I learned what Fasta File formats are.

Best,

Charlie

FWIW, I would write the inner for loop as:

$tmpfh.say( .starts-with(">") ?? ">" ~ .split('|')[1] !! $_ )
for $file.IO.lines;

:-)

Does split produce lazy list in Perl 6?
split('|')[1]
Will it read all fields in memory or just stop after finding second?


BTW: How did you start your career in bioinformatics? Was your primary education biology/genetics and you used Perl as a tool to solve your tasks, or was it the other way - you were a bored programmer that thought one day "it would be cool to sequence and save my hamster to hard drive"?
What is the bio knowledge threshold required to start work in bioinformatics company?

split() currently does not create a lazy list. If the grammar engine is used to split, it creates all Match objects prior to returning and starting to hand them back. If it is not, it is using the internal nqp::split function, which also builds structures in memory (and does not have a parameter to only split into x elements).

So, split is not lazy.

Leave a comment

About kyclark

user-pic I blog about Perl.