Reading sequences from FASTA format alignment by Bio::Perl

Show code for TL;DR:

use Bio::AlignIO;

my $in = Bio::AlignIO->new(

       -file   => $inputfilename ,

       -format => 'fasta',

     );

while ( my $aln = $in->next_aln() ) { 

    foreach my $seq ($aln->each_seq) {

        ...; # do some thing 

    }

}

Bio::Perl is a famous library for managing bioinformatics data, although all knowledge mentioned here can fetch from documents on cpan or Bio::Perl webpages, but I believe there can be numerous articles to explain the same thing. :)

Basic concept

The most difficult aspect for me is understanding which classes' documentation should I read, which object will be generated? and then, which method on which object do I need to invoke for the data I wanted?

so let me descript how a Bio::AlignIO object formed.

Weesperplein.svg

  1. Bio::AlignIO object can contain multiple alignments.

  2. Bio::SimpleAlign contains one alignment.

  3. Bio::LocatableSeq contains one aligned sequence.

For example, let's presume I have a file containing COI alignment for ten samples. If I use this file to call Bio::AlignIO->new(), the new Bio::AlignIO object will contain only one Bio::SimpleAlign object which contains this COI alignment. Then, the Bio::SimpleAlign object contains ten Bio::LocatableSeq objects for aligned COI sequences of each sample.

one alignment is tranditional one gene or multiple genes alignment, e.g., a COI alignment in here. Usually this type of alignment data is stored one file like phylip, nexus or fasta. but Bio::AlignIO can manage blast output, too. so Authors keep the ability of contain multiple alignments in one Bio::AlignIO object. In my condition, I just directly get first Bio::SimpleAlign from it.

Extract sequences

Today I just wanted to retrieve the sequences of each sample in my alignment file (fasta) and import them into Perl. To achieve this, first, we need to call the new method to create a new Bio::AlignIO object, using the alignment file as an argument:

my $in = Bio::AlignIO->new(
       -file   => $inputfilename,
       -format => 'fasta',
     );

Then, we want to extract Bio::SimpleAlign objects from the Bio::AlignIO object. For convenience, we use the same method as described in the Bio::AlignIO documentation. We use a while loop to iterate through each Bio::SimpleAlign object contained in the Bio::AlignIO object (although we know there is only one):

while ( my $aln = $in->next_aln() ) { ... ;}

Now, we need to extract Bio::LocatableSeq objects from Bio::SimpleAlign. Calling the each_seq() method of Bio::SimpleAlign allows us to retrieve Bio::LocatableSeq objects, each storing a sequence:

foreach my $seq ($aln->each_seq) { ...; }

At this point, we're nearly done. Simply by calling the seq() and id() methods of Bio::LocatableSeq objects, we can obtain the sequence ID and the actual sequence data. For instance, to print out the base pairs of each sequence:

foreach my $seq ($aln->each_seq) {print $seq->seq(), "\n"}

References:

  1. Bio::AlignIO
  2. Bio::SimpleAlign
  3. Bio::LocatableSeq
  4. https://en.wikipedia.org/wiki/Sequence_alignment

Leave a comment

About glycine

user-pic I blog about Perl.