In the second part of this series about fast parsers for sequencing applications, I will review the code for the regex based parser. This is shown below (I use v5.38, as you should! because the year is 2023 and you should not have to type use strict; use warnings)
Next Generation Sequencing
(NGS) has really taken off the last few years, as both devices and the
cost of experiments have dramatically declined. NGS decipher the
identity (base composition, the sequence of letters in the alphabet of
DNA and RNA) of nucleic acids and return the results in the fastq open data format. Fastq
files are flat text files with a standardized layout: each molecule
present in the sample that is captured by the sequencer is represented
with four fields:
- a '@' character and is followed by a sequence identifier and an optional description
- one (typically) or more lines of characters in the four letter alphabet of nucleic acids
- a metadata field starting with the "+" optionally followed by the same sequence identifier and description as in the first field
- one, or more lines of the quality of each symbol sequence reported in field 2
A typical sequencing experiment will generate many (potentially hundreds) of gigabytes of information in fastq files, and a typical task in bioinformatics is to strip all other information from fastq files, except the identifier and the sequence itself (the first 2 fields). The resulting file format is known as fasta, and these files are then used for downstream tasks, e.g. building genomes, counting gene products etc. Various command line tools are used in bioinformatic workflows e.g. nexflow to facilitate this task. Two popular tools in common use are seqtk written in C, and seqkit written in Go. A rather quick search of MetaCPAN reveals numerous (perhaps even TOO numerous to count!) modules that offer fastq parsing functionality, but a lightweight converter such as the one implemented by seqtk and seqkit. I thus decided to take Perl for a ride and pit it against these two tools. The major challenges to be addressed by the Perl implementation are as the following:
- support multi-line fields for the sequence and the quality entries.The challenge arises because one
parse each record as 4 consecutive lines, simply checking that the
first line has a header of '@', the third one has one of '+' so that the
file splits neatly to sets of four fields.
- one can't simply break records at a string beginning of '@' (ASCII code 64), as the latter is in the character set of the quality symbols of the fastq records (ASCII 33 to 126)
- the length of the sequence string is equal to the length of the quality string.
that the contents of the third field (the one beginning with '+') is
equal to the identifier and description field if not a null.
- (optionally) that the contents of the sequence field only admit the symbols in the FASTA format specification and that the characters of the quality field are in the ASCII range 33-126