FASTA splitter

By Ken Youens-Clark on September 29, 2016 12:29 AM

#!/usr/bin/env perl6

sub MAIN (

    Str :$fasta! where *.IO.f,

    Int :$number=100,

    Str :$out-dir=$*PROGRAM.IO.dirname

) {

    mkdir $out-dir unless $out-dir.IO.d;

    my $ext      = $fasta.IO.extension;

    my $basename = $fasta.IO.basename;

    $basename ~~ s/\.$ext$//;

    sub writer (@seqs) {

        state $fnum = 0;

        my $fname = $*SPEC.catfile(

            $out-dir, sprintf('%s.%04d.%s', $basename, ++$fnum, $ext)

        );

        printf "Writing %s to %s\n", @seqs.elems, $fname.IO.basename;

        my $out = open $fname, :w;

        $out.print(@seqs.join("\n"));

    };

    my $buffer = '';

    my @seq_buff = ();

    for $fasta.IO.lines -> $line {

        if $line.starts-with('>') {

            push @seq_buff, $buffer if $buffer;

            $buffer = '';

        }

        $buffer ~= $line;

        if @seq_buff.elems == $number {

            writer(@seq_buff);

            @seq_buff = ();

        }

    }

    writer(@seq_buff) if @seq_buff;

    put "Done, see directory '$out-dir'";

}

I know, another script having to do with FASTA? Well, I'm writing posts if only to put out more example Perl 6 code so people can see what's possible. I also happen to need these things, so I'm writing them and exploring what the language makes easy for me and what I don't understand.

In this example, I think the "state" variable inside the "writer" sub is quite interesting. I'm creating a closure with that sub so that I can call it with just the @seqs that need to be written. Since this is a buffer-based program, I need to check if there is still something in the buffer after I leave the main processing loop, hence the "writer."

I hit an interesting bug in my early version because I failed to open the output file with ":w". Instead of telling me that, I got an error about "too many positionals." Kudos to the ever-helpful #perl6 chatroom to help me understand the error of my ways.

I wish the "basename" routine would allow me to pass in an optional file extension that I would like removed the way that the bash function does. I keep finding interesting ways to do this. Here's another:

(my $basename = $file.IO.basename) ~~ s/\.\w*?$//;

But I would rather:

my $basename = $file.IO.basename($file.IO.extension);

Or summat.

Anyway, Perl 6 continues to be a thoroughly enjoyable language to work in!

6 comments

6 Comments

hobbs.cleverdomain.org | September 29, 2016 8:19 PM | Reply

If BioPerl6 becomes a thing, I won't mind at all :)

Ken Youens-Clark | September 29, 2016 11:56 PM | Reply

BioPerl6 is a thing and a very good thing it is (https://github.com/cjfields/bioperl6). Sorry, I should have mentioned that I could have used their FASTA parser. Here I wanted to present a self-contained script for purposes of exploring just a couple of ideas.

Pawel bbkr Pabian | October 2, 2016 2:58 PM | Reply

Thanks for sharing your knowledge. I've learned about $*SPEC from your post :)

I'm not FASTA expert, but my approach would be to use ">" as input separator instead of "\n". And then simply push 100 lines (=sequences) into each file through rotor:

# beware first empty element
writer($_) for $fasta.IO.lines( nl-in => '>').[1..*].rotor(100, :partial);

sub writer (@seqs) {
...
# restore sequence start character
$out.print('>', $_); for $seqs;
};

Of course TIMTOWTDI, however - from my experience - delimiter approach is waaaaaaay faster than buffer juggling.

Pawel bbkr Pabian | October 2, 2016 3:04 PM | Reply

My mistake, should be:

$out.print('>', $_); for @seqs;

Ken Youens-Clark | October 3, 2016 7:31 PM | Reply

Sorry, there's a bug in there in that I didn't re-incorporate the newlines from the original sequences. How embarrassing. Here's a corrected version:

#!/usr/bin/env perl6

sub MAIN (

    Str :$fasta! where *.IO.f,

    Int :$number=100,

    Str :$out-dir=$*PROGRAM.IO.d

) {

    mkdir $out-dir unless $out-dir.IO.d;

    my $ext      = $fasta.IO.extension;

    my $basename = $fasta.IO.basename;

    $basename ~~ s/\.$ext$//;

    sub writer (@seqs) {

        state $fnum = 0;

        my $fname = $*SPEC.catfile(

            $out-dir, sprintf('%s.%04d.%s', $basename, ++$fnum, $ext)

        );

        printf "Writing %s to %s\n", @seqs.elems, $fname.IO.basename;

        my $out = open $fname, :w;

        $out.print(@seqs.join("\n"));

    };

    my (@buff, @seq_buff);

    for $fasta.IO.lines -> $line {

        if $line.starts-with('>') {

            push @seq_buff, @buff.join("\n") if @buff;

            @buff = ();

        }

        @buff.push($line);

        if @seq_buff.elems == $number {

            writer(@seq_buff);

            @seq_buff = ();

        }

    }

    writer(@seq_buff) if @seq_buff;

    put "Done, see directory '$out-dir'";

}

Ken Youens-Clark | October 3, 2016 8:41 PM | Reply

Nice use of "rotor" and "nl-in" to shorten up the code, Pawel! Here's a new version that I like much better using those ideas:

#!/usr/bin/env perl6

sub MAIN (

    Str :$fasta! where *.IO.f,

    Int :$number=100,

    Str :$out-dir=$*PROGRAM.IO.d

) {

    mkdir $out-dir unless $out-dir.IO.d;

    my $ext   = $fasta.IO.extension;

    (my $base = $fasta.IO.basename) ~~ s/\.$ext$//;

    my $fh    = open $fasta, :r, nl-in => '>';

    my $fnum  = 0;

    for $fh.lines.[1..*].rotor($number, :partial) -> @seqs {

        my $fname = $*SPEC.catfile(

            $out-dir, sprintf('%s.%04d.%s', $base, ++$fnum, $ext)

        );

        printf "Writing %s to %s\n", @seqs.elems, $fname.IO.basename;

        my $out = open $fname, :w;

        for @seqs -> $seq { $out.print('>', $seq) }

    }

    put "Done, see directory '$out-dir'";

}

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Ken Youens-Clark

I work for Dr. Bonnie Hurwitz at the University of Arizona where I use Perl quite a bit in bioinformatics and metagenomics. I am also trying to write a book at https://www.gitbook.com/book/kyclark/metagenomics/details. Comments welcome.

More info »

kyclark

FASTA splitter

6 Comments

Leave a comment

About Ken Youens-Clark

Search this blog