FASTQ to FASTQ with Perl 6
#!/usr/bin/env perl6sub MAIN (:$out-dir="", *@fastq) {
if ($out-dir.chars > 0 && ! $out-dir.IO.d) {
mkdir $out-dir;
}my $i = 0;
for @fastq -> $fastq {
(my $basename = $fastq.IO.basename) ~~ s/\.\w*?$//;
my $out-file = $*SPEC.catfile(
$out-dir || $fastq.IO.dirname, $basename ~ '.fa');
printf "%3d: %s -> %s\n",
++$i, $fastq.IO.basename, $out-file;
my $out-fh = open $out-file, :w;for $fastq.IO.lines -> $header, $seq, $break, $qual {
# skip first "@"
$out-fh.print('>' ~ $header.substr(1) ~ "\n");
$out-fh.print($seq);
}
$out-fh.close;
}put "Done.";
}
The FASTQ format is one of the worst conceived in the history of bioinformatics, and that's saying something. The only sane FASTQ format uses 4 lines per sequence: a header starting with an "@" sign, the sequence, the header repeated but starting with a "+" (or just the "+"), and the quality score (in either phred 33 or 40). Here's a sample:
@HWI-ST885:65:C07WUACXX:7:2302:1866:196007 1:N:0:GCCAAT GTAAATGATGATCTGCCGCCGCAGCTCCTTTTTTTCTTTCAAGGCCAATTCGGTAGGCTTCAGCTTGGCGGAGCTTTCAATCACAGCGGCAT + BBBFFAFAIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
What I thought would be fun to show off here is that you can read the contents of a list into more than one variable. Here I'd like to read four lines at a time, so I just read "lines" into four variables. How simple!
Cool!
FWIW, you don't have to name variables that you don't use (but that you do want to "consume") in the for loop. So:
for $fastq.IO.lines -> $header, $seq, $, $ { ... }
is also perfectly valid.
Wow, thanks, Liz! That is really cool.
Also, the title should have been "FASTQ to FASTA." Darn.