FASTQ read-pairer

By Ken Youens-Clark on October 31, 2016 4:50 PM

In bioinformatics, we get data from the sequencing centers usually in BAM or FASTQ format. One of the first steps is to QC the data to remove any reads that have very low quality or which are too short to use. Some sequencing technologies read from both the beginning and the end of the sequence in opposite directions, and so you need to put the reads together. Quite often one of the directional reads (forward or "R1", backward or "R2") are lost to QC, and so those are considered "singletons" that can't be paired with their read mate.

The boss wanted me to write a read pairer in Perl 6 for our class. Challenge accepted! Below is a decent working version, but it's dog slow ("dog" being Mississippi for "very"). Comments welcome!

#!/usr/bin/env perl6

class Fastq {

    has Str $.header;

    has Str $.seq;

    has Str $.qual;

    has IO::Handle $.fh;

    has Str $.name = "";

    has Int $.num  = 0;

    has Int $.dir  = 0;

    submethod BUILD (

        Str :$header,

        Str :$seq,

        Str :$qual,

        IO::Handle :$fh

    ) {

        $!header = $header;

        $!seq    = $seq;

        $!qual   = $qual;

        $!fh     = $fh;

        # isolate ID from header, take off "@"

        (my $id = $header) ~~ s/\s.*//;

        $id ~~ s/^ '@'//;

        # e.g., "SRR1647045.4.1" where 4 is read num, 1 is direction

        if my $match = $id ~~ /(.*) '.' (\d+) '.' (<[12]>)$/ {

            $!name = ~$match[0];

            $!num  = +$match[1];

            $!dir  = +$match[2];

        }

    }

    method Str {

        return join "\n", $.header, $.seq, '+', $.qual;

    }

}

# --------------------------------------------------

subset File of Str where *.IO.f;

sub MAIN (

    File :$r1!,

    File :$r2!,

    Str :$out-dir=$*SPEC.catdir($*CWD, 'out'),

    Bool :$debug=False

) {

    mkdir $out-dir unless $out-dir.IO.d;

    my $fh1_in = open $r1;

    my $fh2_in = open $r2;

    sub bugger (*@strs) { put @strs.join if $debug };

    # e.g., SRR1647045_1.trim.fastq => SRR1647045_R1.fastq

    (my $basename = $r1) ~~ s/'_1' .*//;

    my $r1_out    = $*SPEC.catfile(

                    $out-dir, $basename ~ '_R1.fastq');

    my $r2_out    = $*SPEC.catfile(

                    $out-dir, $basename ~ '_R2.fastq');

    my $sing_out  = $*SPEC.catfile(

                    $out-dir, $basename ~ '_singletons.fastq');

    my $r1_fh     = open $r1_out, :w;

    my $r2_fh     = open $r2_out, :w;

    my $sing_fh   = open $sing_out, :w;

    my $i = 0;

    loop {

        bugger("top o' the loop");

        my $read1 = read_fq($fh1_in);

        my $read2 = read_fq($fh2_in);

        last unless $read1 && $read2;

        bugger("read1 ({$read1.num}) read2 ({$read2.num})");

        given ($read1, $read2) {

            when (Fastq, Nil) { $sing_fh.put(~$read1) }

            when (Nil, Fastq) { $sing_fh.put(~$read2) }

            when (Fastq, Fastq) {

                if $read1.num == $read2.num {

                    $r1_fh.put(~$read1);

                    $r2_fh.put(~$read2);

                }

                else {

                    my ($low, $high) = ($read1, $read2).sort(*.num);

                    loop {

                        bugger("low ({$low.num}) high ({$high.num})");

                        if $low.num == $high.num {

                            my ($r1, $r2) = $low.dir == 1

                                ?? ($low, $high)

                                !! ($high, $low);

                            $r1_fh.put(~$r1);

                            $r2_fh.put(~$r2);

                            last;

                        }

                        else {

                            $sing_fh.put($low);

                            $low = read_fq($low.fh);

                        }

                    }

                }

            }

        }

    }

    put "Done.";

}

sub read_fq (IO::Handle $fh) {

    if $fh.eof {

        return Nil;

    }

    else {

        my ($header, $seq="", $="", $qual="") = $fh.lines;

        return Fastq.new(

            header => ~$header,

            seq    => ~$seq,

            qual   => ~$qual,

            fh     => $fh

        );

    }

}

2 comments

2 Comments

coke | October 31, 2016 7:49 PM | Reply

Do you have some sample input we could run against this with expected output?

Ken Youens-Clark | October 31, 2016 7:58 PM | Reply

Sorry, yes, here is the code and sample data:

https://github.com/kyclark/metagenomics-book/tree/master/perl6/fastq-read-pair

I should also have elaborated on the problem. Usually the R1/R2 files have each read's forward/reverse (respectively) mate in the same order/location. That is, in the raw FASTQ files, read 1 will be the first in R1/R2, then read 2, always in ascending order. But after QC, R1 might start on read #10 because the first nine forward reads were bad while R2 might start on read #2, then jump to read #7, etc. After QC, the pairing is off. Only when you find a mate pair (same read number) can you write them back into their respective R1/R2 out files; otherwise the unmated read goes into the singletons file.

Does that make sense?

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Ken Youens-Clark

I work for Dr. Bonnie Hurwitz at the University of Arizona where I use Perl quite a bit in bioinformatics and metagenomics. I am also trying to write a book at https://www.gitbook.com/book/kyclark/metagenomics/details. Comments welcome.

More info »

kyclark