Of Go, C, Perl and fastq file conversion Vol IV : gone in 60 seconds (or less)

By chrisarg on September 16, 2023 5:30 AM

In the final part of this series, we will test the performance of the four parsers, in a scenario emulating the batch analysis of sequencing data. We will use the sample fastq file 3_OHara_S2_rbcLa_2019_minq7.fastq from https://zenodo.org/record/3736457. This is a 35MB file of 21791 long sequences for a nanopore experiment. Download the data and save them to a directory in your hard disk. Then use the following bash time_fastq2a_shell.txt (change the extension to .sh before running!) to process this file 500 times with each of the four methods : seqtk (C), seqkit (Go), perl - regex (code presented here saved into fastq2a_regex_comp_in_func.pl file) and perl - flag (code presented there saved into the fastq2a_flag.pl file).

Issue the following from the command line to store the timings as a tab separated file:

./time_fastq2a_shell.sh 2> timing.txt && perl -e 'open(my $fh,"<","timing.txt"); $count = 0; while(<$fh>) {chomp;next unless /real/; $count++;/(\d+\.?\d*)m(\d+\.?\d*)s/;print(($count<=500) ? "seqtk" : ($count>500 && $count<1000) ? "seqkit" : ($count>1000 && $count<=1500) ? "regex" : "flag" ,qq(\t),$1*60+$2,"\n")}' > timing_sec.txt

The statistics of these 500 runs (in seconds) are shown below

 library       Mean         SD
   seqtk 0.08651800 0.05804526
  seqkit 0.10549900 0.02889980
    flag 0.09279242 0.01132046
   regex 0.12050400 0.01370685

and a boxplot is used to show the variability of individual runs:

It should not be surprising that the compiled C library achieves the best (but variable) performance, followed closely by the flag based perl version (slower by 7%), the go library (slower by 22% relative to C) and finally the regex perl version. The variability of the Perl parsers was between 2.5 to ~5 less than those of the compiled languages. One of the benefits of the Perl parsers is that they don't have to output to a file, but can spit the results in the namespaces of other perl applications that perhaps use BioPerl, reducing thus the downstream IO overhead for moving results around.