FASTA splitter
#!/usr/bin/env perl6sub MAIN (
Str :$fasta! where *.IO.f,
Int :$number=100,
Str :$out-dir=$*PROGRAM.IO.dirname
) {
mkdir $out-dir unless $out-dir.IO.d;my $ext = $fasta.IO.extension;
my $basename = $fasta.IO.basename;
$basename ~~ s/\.$ext$//;sub writer (@seqs) {
state $fnum = 0;
my $fname = $*SPEC.catfile(
$out-dir, sprintf('%s.%04d.%s', $basename, ++$fnum, $ext)
);
printf "Writing %s to %s\n", @seqs.elems, $fname.IO.basename;
my $out = open $fname, :w;
$out.print(@seqs.join("\n"));
};my $buffer = '';
my @seq_buff = ();
for $fasta.IO.lines -> $line {
if $line.starts-with('>') {
push @seq_buff, $buffer if $buffer;
$buffer = '';
}$buffer ~= $line;
if @seq_buff.elems == $number {
writer(@seq_buff);
@seq_buff = ();
}
}writer(@seq_buff) if @seq_buff;
put "Done, see directory '$out-dir'";
}
I know, another script having to do with FASTA? Well, I'm writing posts if only to put out more example Perl 6 code so people can see what's possible. I also happen to need these things, so I'm writing them and exploring what the language makes easy for me and what I don't understand.
In this example, I think the "state" variable inside the "writer" sub is quite interesting. I'm creating a closure with that sub so that I can call it with just the @seqs that need to be written. Since this is a buffer-based program, I need to check if there is still something in the buffer after I leave the main processing loop, hence the "writer."
I hit an interesting bug in my early version because I failed to open the output file with ":w". Instead of telling me that, I got an error about "too many positionals." Kudos to the ever-helpful #perl6 chatroom to help me understand the error of my ways.
I wish the "basename" routine would allow me to pass in an optional file extension that I would like removed the way that the bash function does. I keep finding interesting ways to do this. Here's another:
(my $basename = $file.IO.basename) ~~ s/\.\w*?$//;
But I would rather:
my $basename = $file.IO.basename($file.IO.extension);
Or summat.
Anyway, Perl 6 continues to be a thoroughly enjoyable language to work in!
If BioPerl6 becomes a thing, I won't mind at all :)
BioPerl6 is a thing and a very good thing it is (https://github.com/cjfields/bioperl6). Sorry, I should have mentioned that I could have used their FASTA parser. Here I wanted to present a self-contained script for purposes of exploring just a couple of ideas.
Thanks for sharing your knowledge. I've learned about $*SPEC from your post :)
I'm not FASTA expert, but my approach would be to use ">" as input separator instead of "\n". And then simply push 100 lines (=sequences) into each file through rotor:
# beware first empty element
writer($_) for $fasta.IO.lines( nl-in => '>').[1..*].rotor(100, :partial);
sub writer (@seqs) {
...
# restore sequence start character
$out.print('>', $_); for $seqs;
};
Of course TIMTOWTDI, however - from my experience - delimiter approach is waaaaaaay faster than buffer juggling.
My mistake, should be:
$out.print('>', $_); for @seqs;
Sorry, there's a bug in there in that I didn't re-incorporate the newlines from the original sequences. How embarrassing. Here's a corrected version:
Nice use of "rotor" and "nl-in" to shorten up the code, Pawel! Here's a new version that I like much better using those ideas: