Of Go, C, Perl and fastq file conversion Vol III : pledging allegiance to the flag

In the previous entry, we presented a regex based fastq parser. As the regex engine maps to a finite state automaton, we should be able to rewrite the parser without regular expressions, using flags to keep track of the part of the record we are at. The code is straightforward, but lacks the elegance of the regex parser. It is provided as a possibly faster alternative to the regular expression implementation.

use v5.38;

sub fastq2a_using_flag ( $input_file, $output_file ) {
my ( $has_seqid, $has_seq, $has_meta, $has_qual ) = ( 0, 0, 0, 0 );
my ( $id, $seq, $meta, $qual ) = ( '', '', '', '' );
my $is_badly_formed_fastq = 0;
my ( $out_fh, $in_fh );
local $/ = "\n";

unless ( open $in_fh, '<', $input_file ) {
warn "Can't open $input_file: $!";
return 0; ## signify failure
}

unless ( open $out_fh, '>', $output_file ) {
warn "Can't open $output_file: $!";
return 0; ## signify failure
}
while (<$in_fh>) {
chomp;
my $line_header = substr( $_, 0, 1 );
unless ($has_seqid) {
if ( $line_header eq '@' ) {
$id = substr( $_, 1 );
$has_seqid = 1;
}
else {
$is_badly_formed_fastq = 1;
last;
}
}
else {
if ( $line_header ne '+' && !$has_meta ) {
$has_seq = 1;
$seq .= $_;
}
elsif ( $line_header eq '+' && !$has_meta && $has_seq ) {
$has_meta = 1;
$meta = substr( $_, 1 );
}
elsif ( $has_meta && !$has_qual ) {
$qual .= $_;
$has_qual = 1;
}
elsif ( $has_seq && $has_seqid && $has_meta && $has_qual ) {
if ( length($seq) == length($qual) && $line_header eq '@' ) {
print {$out_fh} ">$id\n$seq\n";
( $has_seqid, $has_seq, $has_meta, $has_qual ) =
( 0, 0, 0, 0 );
( $id, $seq, $meta, $qual ) = ( '', '', '', '' );
redo;
}
else {
$qual .= $_;
}
}
else {
$is_badly_formed_fastq = 1;
last;
}
}

}

my $retval;
if ($is_badly_formed_fastq) {
warn "bad fastq format\n";

#unlink $output_file if -e $output_file;
$retval = 0; ## signify failure
}
else {
print {$out_fh} ">$id\n$seq\n";
$retval = 1; ## signify success
}
close $in_fh;
close $out_fh;
return $retval;
}

## read two filenames from command line
my ( $input_file, $output_file ) = @ARGV[ 0, 1 ];
fastq2a_using_flag( $input_file, $output_file );

The file  test1.fastq can be used to test whether the two parsers can deal with all possible choices (they do!). In the final post we will test the performance of the four parsers: seqtk (C) seqkit (Go), the regex and the flag based parsers

1 Comment

some of the code is chopped off for me (using Firefox on desktop).

hack to fix:
.entry-body div[style] {
  overflow: scroll;
}

Leave a comment

About chrisarg

user-pic I like to use Perl for material other than text.