Reading sequences from FASTA format alignment by Bio::Perl
Show code for TL;DR:
use Bio::AlignIO;
my $in = Bio::AlignIO->new(
-file => $inputfilename ,
-format => 'fasta',
);
while ( my $aln = $in->next_aln() ) {
foreach my $seq ($aln->each_seq) {
...; # do some thing
}
}
Bio::Perl is a famous library for managing bioinformatics data, although all knowledge mentioned here can fetch from documents on cpan or Bio::Perl webpages, but I believe there can be numerous articles to explain the same thing. :)
Basic concept
The most difficult aspect for me is understanding which classes' documentation should I read, which object will be generated? and then, which method on which object do I need to invoke for the data I wanted?
so let me descript how a Bio::AlignIO object formed.
Bio::AlignIO
object can contain multiple alignments.Bio::SimpleAlign
contains one alignment.Bio::LocatableSeq
contains one aligned sequence.
For example, let's presume I have a file containing COI alignment for ten samples. If I use this file to call Bio::AlignIO->new()
, the new Bio::AlignIO
object will contain only one Bio::SimpleAlign
object which contains this COI alignment. Then, the Bio::SimpleAlign
object contains ten Bio::LocatableSeq
objects for aligned COI sequences of each sample.
one alignment is tranditional one gene or multiple genes alignment, e.g., a COI alignment in here. Usually this type of alignment data is stored one file like phylip, nexus or fasta. but Bio::AlignIO
can manage blast output, too. so Authors keep the ability of contain multiple alignments in one Bio::AlignIO
object. In my condition, I just directly get first Bio::SimpleAlign
from it.
Extract sequences
Today I just wanted to retrieve the sequences of each sample in my alignment file (fasta) and import them into Perl. To achieve this, first, we need to call the new method to create a new Bio::AlignIO
object, using the alignment file as an argument:
my $in = Bio::AlignIO->new(
-file => $inputfilename,
-format => 'fasta',
);
Then, we want to extract Bio::SimpleAlign
objects from the Bio::AlignIO
object. For convenience, we use the same method as described in the Bio::AlignIO
documentation. We use a while loop to iterate through each Bio::SimpleAlign
object contained in the Bio::AlignIO
object (although we know there is only one):
while ( my $aln = $in->next_aln() ) { ... ;}
Now, we need to extract Bio::LocatableSeq
objects from Bio::SimpleAlign
. Calling the each_seq()
method of Bio::SimpleAlign
allows us to retrieve Bio::LocatableSeq
objects, each storing a sequence:
foreach my $seq ($aln->each_seq) { ...; }
At this point, we're nearly done. Simply by calling the seq()
and id()
methods of Bio::LocatableSeq
objects, we can obtain the sequence ID and the actual sequence data. For instance, to print out the base pairs of each sequence:
foreach my $seq ($aln->each_seq) {print $seq->seq(), "\n"}
References:
Leave a comment