FASTA splitter

#!/usr/bin/env perl6

sub MAIN (
Str :$fasta! where *.IO.f,
Int :$number=100,
Str :$out-dir=$*PROGRAM.IO.dirname
) {
mkdir $out-dir unless $out-dir.IO.d;

my $ext = $fasta.IO.extension;
my $basename = $fasta.IO.basename;
$basename ~~ s/\.$ext$//;

sub writer (@seqs) {
state $fnum = 0;
my $fname = $*SPEC.catfile(
$out-dir, sprintf('%s.%04d.%s', $basename, ++$fnum, $ext)
);
printf "Writing %s to %s\n", @seqs.elems, $fname.IO.basename;
my $out = open $fname, :w;
$out.print(@seqs.join("\n"));
};

my $buffer = '';
my @seq_buff = ();
for $fasta.IO.lines -> $line {
if $line.starts-with('>') {
push @seq_buff, $buffer if $buffer;
$buffer = '';
}

$buffer ~= $line;

if @seq_buff.elems == $number {
writer(@seq_buff);
@seq_buff = ();
}
}

writer(@seq_buff) if @seq_buff;

put "Done, see directory '$out-dir'";
}

I know, another script having to do with FASTA? Well, I'm writing posts if only to put out more example Perl 6 code so people can see what's possible. I also happen to need these things, so I'm writing them and exploring what the language makes easy for me and what I don't understand.

In this example, I think the "state" variable inside the "writer" sub is quite interesting. I'm creating a closure with that sub so that I can call it with just the @seqs that need to be written. Since this is a buffer-based program, I need to check if there is still something in the buffer after I leave the main processing loop, hence the "writer."

I hit an interesting bug in my early version because I failed to open the output file with ":w". Instead of telling me that, I got an error about "too many positionals." Kudos to the ever-helpful #perl6 chatroom to help me understand the error of my ways.

I wish the "basename" routine would allow me to pass in an optional file extension that I would like removed the way that the bash function does. I keep finding interesting ways to do this. Here's another:

(my $basename = $file.IO.basename) ~~ s/\.\w*?$//;

But I would rather:

my $basename = $file.IO.basename($file.IO.extension);

Or summat.

Anyway, Perl 6 continues to be a thoroughly enjoyable language to work in!

6 Comments

If BioPerl6 becomes a thing, I won't mind at all :)

Thanks for sharing your knowledge. I've learned about $*SPEC from your post :)

I'm not FASTA expert, but my approach would be to use ">" as input separator instead of "\n". And then simply push 100 lines (=sequences) into each file through rotor:

# beware first empty element
writer($_) for $fasta.IO.lines( nl-in => '>').[1..*].rotor(100, :partial);

sub writer (@seqs) {
...
# restore sequence start character
$out.print('>', $_); for $seqs;
};

Of course TIMTOWTDI, however - from my experience - delimiter approach is waaaaaaay faster than buffer juggling.

My mistake, should be:

$out.print('>', $_); for @seqs;

Leave a comment

About Ken Youens-Clark

user-pic I work for Dr. Bonnie Hurwitz at the University of Arizona where I use Perl quite a bit in bioinformatics and metagenomics. I am also trying to write a book at https://www.gitbook.com/book/kyclark/metagenomics/details. Comments welcome.