Of Go, C, Perl and fastq file conversion Vol II : the Jedi regex

By chrisarg on September 8, 2023 5:01 AM under bioinformatics, fasta, fastq, file parsing, parser, perl, sequencing

In the second part of this series about fast parsers for sequencing applications, I will review the code for the regex based parser. This is shown below (I use v5.38, as you should! because the year is 2023 and you should not have to type use strict; use warnings)

#!/home/chrisarg/perl5/perlbrew/perls/current/bin/perl
use v5.38;

sub fastq2a_using_regex ( $input_file, $output_file ) {
    my $is_badly_formed_fastq = 0;
    my ( $out_fh, $in_fh );
    ## change record separator to \n@
    local $/ = "\n@";
    unless ( open $in_fh, '<', $input_file ) {
        warn "Can't open $input_file: $!";
        return 0;    ## signify failure
    }
    ## open output output_file
    unless ( open $out_fh, '>', $output_file ) {
        warn "Can't open $output_file: $!";
        return 0;    ## signify failure
    }

    my $fasta_record = qr/
        ^@?(\N*$)\n        ## identifier
        (^[^\+@]*)         ## sequence, do not match meta header
        ^\+(\N*)\n         ## meta(data)    
        (.+)\Z             ## qual(ity) scores
    /smx;

    my $carryover = '';
    while (<$in_fh>) {
        chomp;
        my $copy_of_line = $_;
        $_ = $carryover . $_ if $is_badly_formed_fastq;
        my ( $id, $seq, $meta, $qual ) = m/$fasta_record/;
        unless ( $id && $seq && $qual ) {
            $is_badly_formed_fastq = 1;
            $carryover .= "$copy_of_line\n@";
            next;
        }
        $seq  =~ s/\n//g;    ## remove newlines from sequence
        $qual =~ s/\n//g;    ## remove newlines from qual scores
        unless ( length($seq) == length($qual) ) {
            $is_badly_formed_fastq = 1;
            $carryover .= "$copy_of_line\n@";
            next;
        }
        say {$out_fh} ">$id\n$seq";
        $is_badly_formed_fastq = 0;
        $carryover             = '';
    }
    close $in_fh;
    close $out_fh;
    if ($is_badly_formed_fastq) {
        warn "bad fastq format\n";
        unlink $output_file if -e $output_file;
        return 0;    ## signify failure
    }
    1;               ## signify success
}

## read two filenames  from command line
my ( $input_file, $output_file ) = @ARGV[ 0, 1 ];
fastq2a_using_regex( $input_file, $output_file );

The logic is rather straightforward:

Change the record separator from "\n" to "\n@", so that one chunks the files into records (purists may object to the use of local variables here, but the code chunk will be reused in another block context, and I really, really prefer not to forget setting the scope to local for this variable)
Use a regex to tease the record apart into the four fields
Do a rudimentary check to ensure that all four fields were recovered by the regex, and that the length of the quality string is equal to that of the sequence
If the check failed, then it is likely that one run into a "@" at the beginning of the sequence quality field, so set a flag and keep reading lines until we get the size of this field to equal the size of the sequence field. When this happens, reset the quality flag and resume the usual execution logic. This occurrence should be rare, as quality symbols over the value of ASCII 64 (= '@') are rare, and those at the beginning of the sequence are even more rare, but it ensures that we don't break the code
Write the id and the sequence
Rinse and repeat until the entire file has been processed and return success :)

The multi-line regex part has four components :

the identifier matches a '@' at the beginning of the field and captures all new line characters until the end ($) of the first line
the sequence matches all characters including new-lines until the regex engine
the metadata field, demarcated by a '+' at the beginning of the field, and a bunch of optional non-newline characters until the of that line
the quality field. While newline characters are not part of the range, the quality can extend across multiple lines. The capture of a potentially multi-lined field is effected by matching any character (including a newline!) by the /s flag in the regex.

Note that I didn't bother checking the second constraint in the file record, i.e. that the content of field 3, if present, be identical to that of field 1, as most sequencers leave this field empty to begin with. Also other checks such that the symbols in the quality and sequence field belong to the admissible set are not carried out, as existing tools don't bother with these checks which happen anyway in downstream steps of the bioinformatics data flows.

In contrast to popular and misguided opinion, the regex is not line-noisy and the use of the /x flag makes for a rather readable regex code segment. The price of elegance is slowness relative to compiled languages, and the reason is the use of regex logic. However the loss of speed is minimal as we will show in the last part of this series

2 comments

Tagged as:

Bioinformatics, fasta, fastq, parsing

2 Comments

Matthew Persico | September 15, 2023 6:44 PM | Reply

The text "In contrast to popular and misguided opinion" is underlined but it does not resolve to a link. Is this intended?

chrisarg | September 16, 2023 5:44 AM | Reply

Yes, it is intended to give emphasis :)

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About chrisarg

I like to use Perl for material other than text.

More info »

chrisarg