Using Perl to prepare sequencing files to submit to NCBI's GEO

By chrisarg on January 22, 2023 1:12 AM under bioinformatics

In the middle of a manuscript submission that requires sequencing data to be uploaded to NCBI's Gene Expression Omnibus.
This is a fairly standardized and (painful!) process that requires one to assemble their sequencing data (a collection of hundreds or thousands of files in the FASTQ format), put them in a single (very large) folder, compress them, generate md5 hashes and then upload them to GEO's FTP site.
There are a couple of tutorials available e.g. here and there that mostly cover the use case of one having assembled the files into a single fastq.
Our project used Oxford's Nanopore platform which store's its data as a series of fastq files, each holding a user defined number of sequences (in our case 2,000). Some of the experiments had generated an excess of 10M reads, so we are talking about a serious number of files to process:

uncompress (if compressed)

concatanate

compress

hash using md5sum

I am extremely lazy to do repetitive shell typing, so I figure to use PERL to package the steps together. One needs first to create a list of directories that have the sequencing files (by default nanopore stores them in a directory that contains two subdirectories, called "fastq_pass" and "fastq_fail", with the pass and fail referring to files with different sequencing qualities.
I hardwired the name of that CSV in the script (but you could easily modify to use the commandline to get an arbitrary file), as well as the name of the output directory (which must be created beforehand).
The repetitive steps are now taken care by this PERL script (which is boring, but useful).

#!/home/chrisarg/perl5/perlbrew/perls/current/bin/perl
## compress files for NCBI
use v5.36;
use experimental qw(signatures);
use strict;
use Text::CSV qw( csv );
use File::Spec ;
use File::Basename;
use File::Copy;

use FindBin qw($Bin);

my $out_directory = File::Spec->catdir(dirname($Bin),'chrisarg_NCBI_submission');
my $analyses_files =
    csv(in => File::Spec->catfile($Bin,"package_PALS_NS_2.csv"),
            sep_char =>',',headers => "auto")
        or die Text::CSV->error_diag;
my @fastq_directories = map {
    File::Spec->catdir(dirname($Bin),values %{ $_})
    } @$analyses_files;
my @fastq_output_names = map {
    basename(values %{ $_}).".fastq";
    } @$analyses_files;

## directories go here
my (@files, @gz, @fastq,@dir_of_files,@dest_files, $output_fh);
while (my($index,$current_experiment_dir) = each @fastq_directories){
    @dir_of_files = (File::Spec->catdir($current_experiment_dir,'fastq_pass'),
        File::Spec->catdir($current_experiment_dir,'fastq_fail'),
    );
    for my $current_dir (@dir_of_files){
        say "Processing $current_dir";
        chdir $current_dir or die "can't cd into $current_dir";
        @gz = glob('*.gz');
        @fastq = glob('*.fastq');
        if (scalar @gz > 1){
            @dest_files = map {
                File::Spec->catfile($out_directory,$_)
            } @gz;
            foreach (0..$#dest_files){
                copy($gz[$_],$dest_files[$_])
                    or die "Can't copy $gz[$_] because of $!";
                system 'gunzip', $dest_files[$_];
            }
        }
        ## may have to deal with mix of gzpd and not files
        if (scalar @fastq > 1){
            @dest_files = map {
                File::Spec->catfile($out_directory,$_)
            } @fastq;
            foreach (0..$#dest_files){
                copy($fastq[$_],$dest_files[$_])
                    or die "Can't copy $gz[$_] because of $!";
            }
        }
    }
    chdir $out_directory;
    @dest_files = glob('*.fastq');
    open $output_fh, '>',$fastq_output_names[$index]
        or die "failed at $fastq_output_names[$index] due to $!";
    foreach my $input_file (@dest_files){
        open my $input_fh , '<', $input_file
            or die "can't open $input_file for reading";
        print $output_fh $_ while <$input_fh>;
    }
    close $output_fh;
    unlink @dest_files;
   !system 'gzip', '-9', $fastq_output_names[$index]
    or die 'something went wrong';
}

chdir $out_directory;
system "md5sum * > md5sum.txt";

2 comments

Tagged as:

Bioinformatics, GEO, NCBI, sequencing

2 Comments

Saif | January 23, 2023 12:43 AM | Reply

Looks very useful and time saving, well done ... is it worth submitting this as a module perhaps to BioPerl?

The Mighty Buzzard replied to comment from Saif | January 25, 2023 12:24 PM | Reply

No reason not to but I'd personally write this as a shell script. There's no text processing being done to speak of, so it would make more sense to me to write it as a much more legible bash script of about 1/4 the size.

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About chrisarg

I like to use Perl for material other than text.

More info »

chrisarg