Cleaning up the IDs in a FASTA file
I have some FASTA files with headers like this:
>gi|83274083|ref|AC_000032.1| Mus musculus strain mixed chromosome 10, alternate assembly Mm_Celera, whole genome shotgun sequence
I wanted to extract just the 2nd field, so here's a Perl 6 script to do that:
#!/usr/bin/env perl6use File::Temp;
sub MAIN (*@files) {
my $i = 0;
for @files -> $file {
my ($tmpfile, $tmpfh) = tempfile();
printf "%3d: %s\n", ++$i, $file.IO.basename;
for $file.IO.lines -> $line {
if $line.substr(0,1) eq '>' {
my @flds = $line.split('|');
$tmpfh.print(">" ~ @flds[1] ~ "\n");
}
else {
$tmpfh.print("$line\n");
}
}$tmpfh.close;
move $tmpfile, $file;
}
}put "Done.";
Hello kyclark,
Thanks you for this post, Today I learned what Fasta File formats are.
Best,
Charlie
FWIW, I would write the inner for loop as:
$tmpfh.say( .starts-with(">") ?? ">" ~ .split('|')[1] !! $_ )
for $file.IO.lines;
:-)
Does split produce lazy list in Perl 6?
split('|')[1]
Will it read all fields in memory or just stop after finding second?
BTW: How did you start your career in bioinformatics? Was your primary education biology/genetics and you used Perl as a tool to solve your tasks, or was it the other way - you were a bored programmer that thought one day "it would be cool to sequence and save my hamster to hard drive"?
What is the bio knowledge threshold required to start work in bioinformatics company?
split() currently does not create a lazy list. If the grammar engine is used to split, it creates all Match objects prior to returning and starting to hand them back. If it is not, it is using the internal nqp::split function, which also builds structures in memory (and does not have a parameter to only split into x elements).
So, split is not lazy.
"starts-with" is a great thing to show my beginner students! I keep forgetting about that, but it's a nice borrow from Python. Thanks!
Pawel, I accidentally stumbled into bioinformatics. I was a Perl hacker who got hired by Lincoln Stein back in 2001 to work at Cold Spring Harbor Lab. It was immediately fascinating and horribly intimidating. I have never felt even a little bit competent as a biologist, but I keep trying to learn as much as possible. After working for CSHL (on the Gramene.org project), I moved to the Univ. of Arizona where I work for Dr. Bonnie Hurwitz in a field called "metagenomics." I'm still in way over my head, but she puts up with me because I can hack together pipelines. So, just keep learning, get in at entry level, and never stop.