Cleaning up the IDs in a FASTA file

By Ken Youens-Clark on September 20, 2016 8:00 PM

I have some FASTA files with headers like this:

>gi|83274083|ref|AC_000032.1| Mus musculus strain mixed chromosome 10, alternate assembly Mm_Celera, whole genome shotgun sequence

I wanted to extract just the 2nd field, so here's a Perl 6 script to do that:

#!/usr/bin/env perl6

use File::Temp;

sub MAIN (*@files) {

    my $i = 0;

    for @files -> $file {

        my ($tmpfile, $tmpfh) = tempfile();

        printf "%3d: %s\n", ++$i, $file.IO.basename;

        for $file.IO.lines -> $line {

            if $line.substr(0,1) eq '>' {

                my @flds = $line.split('|');

                $tmpfh.print(">" ~ @flds[1] ~ "\n");

            }

            else {

                $tmpfh.print("$line\n");

            }

        }

        $tmpfh.close;

        move $tmpfile, $file;

    }

}

put "Done.";

6 comments

6 Comments

Charlie Gonzalez | September 21, 2016 3:52 AM | Reply

Hello kyclark,

Thanks you for this post, Today I learned what Fasta File formats are.

Best,

Charlie

Liz | September 21, 2016 8:55 AM | Reply

FWIW, I would write the inner for loop as:

$tmpfh.say( .starts-with(">") ?? ">" ~ .split('|')[1] !! $_ )
for $file.IO.lines;

:-)

Pawel bbkr Pabian | September 22, 2016 4:22 PM | Reply

Does split produce lazy list in Perl 6?
split('|')[1]
Will it read all fields in memory or just stop after finding second?

BTW: How did you start your career in bioinformatics? Was your primary education biology/genetics and you used Perl as a tool to solve your tasks, or was it the other way - you were a bored programmer that thought one day "it would be cool to sequence and save my hamster to hard drive"?
What is the bio knowledge threshold required to start work in bioinformatics company?

Liz | September 23, 2016 10:43 AM | Reply

split() currently does not create a lazy list. If the grammar engine is used to split, it creates all Match objects prior to returning and starting to hand them back. If it is not, it is using the internal nqp::split function, which also builds structures in memory (and does not have a parameter to only split into x elements).

So, split is not lazy.

Ken Youens-Clark replied to comment from Liz | September 26, 2016 11:49 PM | Reply

"starts-with" is a great thing to show my beginner students! I keep forgetting about that, but it's a nice borrow from Python. Thanks!

Ken Youens-Clark replied to comment from Pawel bbkr Pabian | September 26, 2016 11:52 PM | Reply

Pawel, I accidentally stumbled into bioinformatics. I was a Perl hacker who got hired by Lincoln Stein back in 2001 to work at Cold Spring Harbor Lab. It was immediately fascinating and horribly intimidating. I have never felt even a little bit competent as a biologist, but I keep trying to learn as much as possible. After working for CSHL (on the Gramene.org project), I moved to the Univ. of Arizona where I work for Dr. Bonnie Hurwitz in a field called "metagenomics." I'm still in way over my head, but she puts up with me because I can hack together pipelines. So, just keep learning, get in at entry level, and never stop.

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Ken Youens-Clark

I work for Dr. Bonnie Hurwitz at the University of Arizona where I use Perl quite a bit in bioinformatics and metagenomics. I am also trying to write a book at https://www.gitbook.com/book/kyclark/metagenomics/details. Comments welcome.

More info »

kyclark