FASTQ to FASTQ with Perl 6

#!/usr/bin/env perl6

sub MAIN (:$out-dir="", *@fastq) {
if ($out-dir.chars > 0 && ! $out-dir.IO.d) {
mkdir $out-dir;
}

my $i = 0;
for @fastq -> $fastq {
(my $basename = $fastq.IO.basename) ~~ s/\.\w*?$//;
my $out-file = $*SPEC.catfile(
$out-dir || $fastq.IO.dirname, $basename ~ '.fa');
printf "%3d: %s -> %s\n",
++$i, $fastq.IO.basename, $out-file;
my $out-fh = open $out-file, :w;

for $fastq.IO.lines -> $header, $seq, $break, $qual {
# skip first "@"
$out-fh.print('>' ~ $header.substr(1) ~ "\n");
$out-fh.print($seq);
}
$out-fh.close;
}

put "Done.";
}

The FASTQ format is one of the worst conceived in the history of bioinformatics, and that's saying something. The only sane FASTQ format uses 4 lines per sequence: a header starting with an "@" sign, the sequence, the header repeated but starting with a "+" (or just the "+"), and the quality score (in either phred 33 or 40). Here's a sample:

@HWI-ST885:65:C07WUACXX:7:2302:1866:196007 1:N:0:GCCAAT
GTAAATGATGATCTGCCGCCGCAGCTCCTTTTTTTCTTTCAAGGCCAATTCGGTAGGCTTCAGCTTGGCGGAGCTTTCAATCACAGCGGCAT
+
BBBFFAFAIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

What I thought would be fun to show off here is that you can read the contents of a list into more than one variable. Here I'd like to read four lines at a time, so I just read "lines" into four variables. How simple!

2 Comments

Cool!

FWIW, you don't have to name variables that you don't use (but that you do want to "consume") in the for loop. So:

for $fastq.IO.lines -> $header, $seq, $, $ { ... }

is also perfectly valid.

Leave a comment

About Ken Youens-Clark

user-pic I work for Dr. Bonnie Hurwitz at the University of Arizona where I use Perl quite a bit in bioinformatics and metagenomics. I am also trying to write a book at https://www.gitbook.com/book/kyclark/metagenomics/details. Comments welcome.