September 2023 Archives

Hash of Arrays Deathmatch : Native Perl vs. DBM::Deep vs. Redis

Sometimes one has to make compromises between speed of execution and memory, other times one may not have to be. While working towards a fairly (at least in my mind) complete solution to map Nanopore Sequencing files, I ran against the need to create and access fairly large hash of arrays (think of 1M - 100Mof keys), with each array itself consisting of a a fixed number of elements.
The hash of arrays is a fairly straightforward and fast data structure to create in Perl, the memory overhead can be substantial as the number of keys and values scale upwards.…

Of Go, C, Perl and fastq file conversion Vol IV : gone in 60 seconds (or less)

In the final part of this series, we will test the performance of the four parsers, in a scenario emulating the batch analysis of sequencing data. We will use the sample fastq file 3_OHara_S2_rbcLa_2019_minq7.fastq from https://zenodo.org/record/3736457. This is a 35MB file of 21791 long sequences for a nanopore experiment. Download the data and save them to a directory in your hard disk. Then use the following bash time_fastq2a_shell.txt (change the extension to .sh before running!) to process this file 500…

Of Go, C, Perl and fastq file conversion Vol III : pledging allegiance to the flag

In the previous entry, we presented a regex based fastq parser. As the regex engine maps to a finite state automaton, we should be able to rewrite the parser without regular expressions, using flags to keep track of the part of the record we are at. The code is straightforward, but lacks the elegance of the regex parser. It is provided as a possibly faster alternative to the regular expression implementation.

="color:…

Of Go, C, Perl and fastq file conversion Vol II : the Jedi regex

In the second part of this series about fast parsers for sequencing applications,  I will review the code for the regex based parser. This is shown below (I use v5.38, as you should! because the year is 2023 and you should not have to type use strict; use warnings)

="color:…

Of Go, C, Perl and fastq file conversion Vol I : intro

Next Generation Sequencing (NGS) has really taken off the last few years, as both devices and the cost of experiments have dramatically declined. NGS decipher the identity (base composition, the sequence of letters in the alphabet of DNA and RNA) of nucleic acids and return the results in the fastq open data format. Fastq files are flat text files with a standardized layout: each molecule present in the sample that is captured by the sequencer is represented with four fields:

  1. a '@' character and is followed by a sequence identifier and an optional description

  2. one (typically) or more lines of characters in the four letter alphabet of nucleic acids

  3. a metadata field starting with the "+" optionally followed by the same sequence identifier and description as in the first field

  4. one, or more lines of the quality of each symbol sequence reported in field 2

An example of such a four field entry may look something like this

A typical sequencing experiment will generate many (potentially hundreds) of gigabytes of information in fastq files, and a typical task in bioinformatics is to strip all other information from fastq files, except the identifier and the sequence itself (the first 2 fields). The resulting file format is known as fasta, and these files are then used for downstream tasks, e.g. building genomes, counting gene products etc. Various command line tools are used in bioinformatic workflows e.g. nexflow to facilitate this task. Two popular tools in common use are seqtk written in C, and seqkit written in Go. A rather quick search of MetaCPAN reveals numerous (perhaps even TOO numerous to count!) modules that offer fastq parsing functionality, but a lightweight converter such as the one implemented by seqtk and seqkit. I thus decided to take Perl for a ride and pit it against these two tools. The major challenges to  be addressed by the Perl implementation are as the following:

  1. support multi-line fields for the sequence and the quality entries.The challenge arises because one 
    • can't parse each record as 4 consecutive lines, simply checking that the first line has a header of '@', the third one has one of '+' so that the file splits neatly to sets of four fields.
    • one can't simply break records at a string beginning of '@' (ASCII code 64), as the latter is in the character set of the quality symbols of the fastq records (ASCII 33 to 126)
  2. carry out basic sanity checks on the parsed records:
    • the length of the sequence string is equal to the length of the quality string.
    • (optionally) that the contents of the third field (the one beginning with '+') is equal to the identifier and description field if not a null.
    • (optionally) that the contents of the sequence field only admit the symbols in the FASTA format specification and that the characters of the quality field are in the ASCII range 33-126

About chrisarg

user-pic I like to use Perl for material other than text.