September 2023 Archives
Of Go, C, Perl and fastq file conversion Vol IV : gone in 60 seconds (or less)
In the final part of this series, we will test the performance of the four parsers, in a scenario emulating the batch analysis of sequencing data. We will use the sample fastq file 3_OHara_S2_rbcLa_2019_minq7.fastq from https://zenodo.org/record/3736457. This is a 35MB file of 21791 long sequences for a nanopore experiment. Download the data and save them to a directory in your hard disk. Then use the following bash time_fastq2a_shell.txt (change the extension to .sh before running!) to process this file 500…
Of Go, C, Perl and fastq file conversion Vol III : pledging allegiance to the flag
Of Go, C, Perl and fastq file conversion Vol II : the Jedi regex
In the second part of this series about fast parsers for sequencing applications, I will review the code for the regex based parser. This is shown below (I use v5.38, as you should! because the year is 2023 and you should not have to type use strict; use warnings)
Of Go, C, Perl and fastq file conversion Vol I : intro
Next Generation Sequencing
(NGS) has really taken off the last few years, as both devices and the
cost of experiments have dramatically declined. NGS decipher the
identity (base composition, the sequence of letters in the alphabet of
DNA and RNA) of nucleic acids and return the results in the fastq open data format. Fastq
files are flat text files with a standardized layout: each molecule
present in the sample that is captured by the sequencer is represented
with four fields:
- a '@' character and is followed by a sequence identifier and an optional description
- one (typically) or more lines of characters in the four letter alphabet of nucleic acids
- a metadata field starting with the "+" optionally followed by the same sequence identifier and description as in the first field
- one, or more lines of the quality of each symbol sequence reported in field 2
A typical sequencing experiment will generate many (potentially hundreds) of gigabytes of information in fastq files, and a typical task in bioinformatics is to strip all other information from fastq files, except the identifier and the sequence itself (the first 2 fields). The resulting file format is known as fasta, and these files are then used for downstream tasks, e.g. building genomes, counting gene products etc. Various command line tools are used in bioinformatic workflows e.g. nexflow to facilitate this task. Two popular tools in common use are seqtk written in C, and seqkit written in Go. A rather quick search of MetaCPAN reveals numerous (perhaps even TOO numerous to count!) modules that offer fastq parsing functionality, but a lightweight converter such as the one implemented by seqtk and seqkit. I thus decided to take Perl for a ride and pit it against these two tools. The major challenges to be addressed by the Perl implementation are as the following:
- support multi-line fields for the sequence and the quality entries.The challenge arises because one
- can't
parse each record as 4 consecutive lines, simply checking that the
first line has a header of '@', the third one has one of '+' so that the
file splits neatly to sets of four fields.
- one can't simply break records at a string beginning of '@' (ASCII code 64), as the latter is in the character set of the quality symbols of the fastq records (ASCII 33 to 126)
- carry out basic sanity checks on the parsed records:
- the length of the sequence string is equal to the length of the quality string.
- (optionally)
that the contents of the third field (the one beginning with '+') is
equal to the identifier and description field if not a null.
- (optionally) that the contents of the sequence field only admit the symbols in the FASTA format specification and that the characters of the quality field are in the ASCII range 33-126