Closures, alternatives, map in Perl 6

In a script I recently wrote, I employed a few features of Perl 6 that I'd like to highlight. I'm using Mash to create a distance matrix of samples (usually metagenomes or genomes) to each other, either in a complete pair-wise fashion or some set (like the Pacific Ocean Virome) to some new set of samples.

The output of Mash is a tab-delimited file. The first line contains the column headers, and the first column is the literal string "#query" followed by all the sample/file names which include the relative path information. Here is a sample:

#query  fasta/L.Spr.C.1300m.fa  fasta/M.Fall.O.105m.fa  fasta/L.Win.O.10m.fa    fasta/SMS.Spr.C.5m.fa   fasta/M.Fall.I.10m.fa   fasta/L.Sum.O.2000m.fa  fasta/M.Fall.O.1000m.fa fasta/M.Fall.O.4300m.fa fasta/L.Spr.I.2000m.fa  fasta/L.Spr.C.1000m.fa  fasta/L.Spr.O.2000m.fa  fasta/M.Fall.C.10m.fa   fasta/L.Spr.O.10m.fa    fasta/SFC.Spr.C.5m.fa   fasta/SFS.Spr.C.5m.fa   fasta/L.Sum.O.500m.fa   fasta/L.Spr.I.10m.fa    fasta/L.Sum.O.10m.fa    fasta/L.Spr.I.500m.fa   fasta/L.Spr.C.10m.fa    fasta/GD.Spr.C.8m.fa    fasta/M.Fall.I.42m.fa   fasta/L.Win.O.1000m.fa  fasta/L.Win.O.500m.fa   fasta/M.Fall.O.10m.fa   fasta/SFD.Spr.C.5m.fa   fasta/L.Spr.O.1000m.fa  fasta/STC.Spr.C.5m.fa   fasta/L.Spr.I.1000m.fa  fasta/L.Sum.O.1000m.fa  fasta/L.Spr.C.500m.fa   fasta/L.Win.O.2000m.fa  fasta/GF.Spr.C.9m.fa

All the other lines are the reference samples (with relative path information to the index) followed by a value between 0 and 1 indicating the distance (shared genetic content as determined by k-mers in this case) between the samples where "0" indicates identical samples and "1" is completely unrelated:

./HOT/HOT226_1_0500m/HOT226_1_0500m.fastq   1   1   1   1   1   1   0.295981    1   1   1   1   1   1   1   1   0.295981    1   1   1   1   1   1   0.295981    1   1   1   1   1   1   1   0.295981    1   1

I need to clean up this file a bit to


  • remove the "#query" -- the comment character can cause problems for some downstream analysis tools, and the presence of anything in that field causes problems for others

  • use the basenames of the files, preferably without any suffixes like ".gz," ".fasta," ".fastq," etc., because certain visualization tools will use the sample names and they need to be as short as possible

  • allow for the the substitution of sample names from the user, i.e., a mapping from sample file names to some alias

  • remove lines where all the values are equal to "1" (there's no shared sequence space)

  • convert from a "distance" matrix to a "nearness" by subtracting the distance value from 1

Here's the script I wrote to do this:

#!/usr/bin/env perl6

subset File of Str where *.IO.f;

sub MAIN (
    File :$in!,
    Str  :$out="",
    Str  :$alias-file="",
    Bool :$nearness=False
) {
    my %alias;
    if $alias-file && $alias-file.IO.f {
        my $fh  = open $alias-file;
        my @hdr = $fh.get.split(/\t/); # name, alias -- not used
        %alias  = $fh.lines.map(*.split(/\t/)).flat;
    }

    sub clean-file-name($file) {
        my $basename = $file.IO.basename;
        $basename   ~~ s/'.' msh $//;
        $basename   ~~ s/'.' gz $//;
        $basename   ~~ s/'.' fn?a(st<[aq]>)? $//;
        %alias{$basename} || $basename;
    }

    my $out-file = $out || do {
        my $ext      = $in.IO.extension;
        my $basename = $in.IO.basename.subst(/'.' $ext $/, '');
        $*SPEC.catfile($in.IO.dirname, $basename ~ '.2.' ~ $ext);
    };

    my $in-fh  = open $in, :r;
    my $out-fh = open $out-file, :w;

    for $in-fh.lines.kv -> $i, $line {
        my @flds = $line.split(/\t/);

        if $i == 0 {
            my $query = @flds.shift; # literal "#query", not needed
            my @files = @flds.map(&clean-file-name);
            $out-fh.put(join("\t", flat "", @files));
        }
        else {
            my $file = clean-file-name(@flds.shift);
            next if all(@flds) == 1;
            if $nearness {
                @flds = @flds.map(1 - *);
            }
            $out-fh.put(join "\t", flat $file, @flds);
        }
    }

    $in-fh.close;
    $out-fh.close;

    put "Done, see '$out-file'";
}

In my last post, I wrote about the fabulous MAIN subroutine, so I won't belabor the arguments list. Suffice to say this is the USAGE:

Usage:
  ./process-dist.pl6 --in= [--out=] [--alias-file=] [--nearness (False True)]

I need an --in file argument and optionally an --out filename, and --alias-file name which has the file/name mappings, and a --nearness option to indicate that I want to reverse the numbers.

I will use a hash to handle the alias mapping, if present. I look to see if there is an $alias-file provided and if it is an actual file (.IO.f). If so, I open it, read off the first line of column names ("name," "alias"), and then grab all the lines which I map into a code block where I split the whatever/thing on tabs. Finally, I flatten all that (which is a list of 2-member lists) so to populate the %alias hash.

    my %alias;
    if $alias-file && $alias-file.IO.f {
        my $fh  = open $alias-file;
        my @hdr = $fh.get.split(/\t/); # name, alias -- not used
        %alias  = $fh.lines.map(*.split(/\t/)).flat;
    }

Next I create a simple closure around the %alias while encapsulating all the logic of how I want to clean up the file names. I can treat any string like an IO::Path object with the IO method so that I can use the basename routine to strip off the directory names. From there I'm using the ~~ operator to apply the s (in-place substitution) to remove a series of offensive suffixes. The third substitution will remove ".fa, ".fna", ".fasta," and ".fastq". Perl will return the last value of a subroutine automatically (like Haskell but unlike Javascript), so I just state the last line is the value of the alias for this (cleaned-up) filename or the cleaned-up name:

    sub clean-file-name($file) {
        my $basename = $file.IO.basename;
        $basename   ~~ s/'.' msh $//;
        $basename   ~~ s/'.' gz $//;
        $basename   ~~ s/'.' fn?a(st<[aq]>)? $//;
        %alias{$basename} || $basename;
    }

If the user does not supply an --out file name, then I need to construct one from the existing file plus a "2" before the extension. To do so, I get the basename and subst (substitute) the extension for nothing (I wish there were an easier way!). Then I use the global $*SPEC object's catfile routine to piece back together the path, basename, ".2", and extension for my new filename.

    my $out-file = $out || do {
        my $ext      = $in.IO.extension;
        my $basename = $in.IO.basename.subst(/'.' $ext $/, '');
        $*SPEC.catfile($in.IO.dirname, $basename ~ '.2.' ~ $ext);
    };

Now I'm ready to open my input and output files. The default is to open for "read" (:r). It's necessary to open the output for "write" with :w or bad things happen when you try to print things to that handle.

    my $in-fh  = open $in, :r;
    my $out-fh = open $out-file, :w;

In another post I write about using kv (key/value) on lists like that returned from "lines." The reason is that I need to know if I'm on the first line (well, zeroth). If so, I want to shift off the first member of the @flds to get rid of that pesky "#query" from Mash. Then I want to process all the rest of the lines with my aforementioned &clean-file-name routine. Note that I preface the subroutine name with the ampersand to indicate a reference and not an directive to execute it. Lastly, I put the new string made of an empty string and the cleaned up filenames (or aliases). It necessary to use flat because Perl 6 respects lists-of-lists whereas Perl 5 automatically flattened these structures:

        if $i == 0 {
            my $query = @flds.shift; # literal "#query", not needed
            my @files = @flds.map(&clean-file-name);
            $out-fh.put(join("\t", flat "", @files));
        }

If I'm not on the first line, then I need to clean up the first field and decide how to handle the rest of the values. If the user has indicated that I need to invert the distance matrix into a nearness matrix, I need to subtract the values from 1:

        else {
            my $file = clean-file-name(@flds.shift);
            next if all(@flds) == 1;
            if $nearness {
                @flds = @flds.map(1 - *);
            }
            $out-fh.put(join "\t", flat $file, @flds);
        }

Then I simply close the files and let the user know where to find the output file:

    $in-fh.close;
    $out-fh.close;

    put "Done, see '$out-file'";

Here's hoping you find something useful in this!

Leave a comment

About Ken Youens-Clark

user-pic I work for Dr. Bonnie Hurwitz at the University of Arizona where I use Perl quite a bit in bioinformatics and metagenomics. I am also trying to write a book at https://www.gitbook.com/book/kyclark/metagenomics/details. Comments welcome.