What are BioPerl's weaknesses?

By pyrimidine on December 14, 2009 5:30 PM under BioPerl, Biology

(Note: this is a repost due to some of the beta-ness of the site).

BioPerl is a commonly-used toolkit for bioinformatics, but it does have it's share of problems. As educated_foo has pointed out, it can be slow, documentation is spotty for some modules, classes can be overly complex in some cases, and there are interface issues. And, for simple formats like FASTA it's probably overkill if you are just parsing data. Not to mention the monolithic nature of the beast; I think it's over 1000 modules now.

We in the BioPerl community haven't heard much about these issues from users. What (specifically) bugs you about BioPerl? What could be improved? In other words, what are it's weaknesses? Specifics would be nice.

(Just for some balance, I'll blog about some of BioPerl's strengths at a later point)

Rattling off a couple, from educated_foo and myself:

Make it less monolithic. Break up 'core' into more manageable bits.
Simplify classes.
Better tests?
Maybe a BioPerl Manual? (we do have HOWTOs that could work as starter material...)
Moose? Some of those interfaces fit nicely into Roles...
Allow parsers to return simple data structures (hash refs?) as well as objects.
Re-evaluate interface? educated_foo mentioned problems, but specifics would be nice.

Others?

7 comments

7 Comments

educated_foo | December 15, 2009 7:20 AM | Reply

I gave a detailed critique on reading PDB files on the previous entry. Maybe someone has a cached copy.

pyrimidine replied to comment from educated_foo | December 15, 2009 1:42 PM | Reply

That was about Bio::Structure::IO. I did manage to read it before it went bye-bye. I do remember agreeing with pretty much everything you mention, just can't recall specifics (slowness and design were two maybe?).

The problem (though it's not an excuse) is that particular section of modules doesn't nearly as much attention as the rest of BioPerl. We (core devs) unfortunately don't have time to refactor every part of BioPerl. I think de-monolithizing the core will help tremendously, as Bio::Structure modules would very likely exist as their own entity within CPAN. In this case a significant refactor isn't out of the question and wouldn't hinder the release of the other modules, one of the largest impediments to making a new core release at the moment.

educated_foo | December 15, 2009 6:06 PM | Reply

Over-engineered design and poor documentation were the main points. I think removing Bio::Structure from BioPerl would be the easiest solution. The loss of intended functionality would suck, but it's too poorly implemented to be useful now.

pyrimidine | December 15, 2009 6:38 PM | Reply

Overall I tend to agree with you, but both the lack of docs and over-engineering really depends on the specific set of classes. In many cases there are implementations that have been simplified, in others there are ways that it could be simplified but haven't been attempted for various reasons (including the issues re: to core being monolithic). Splitting things up will help. I appreciate the feedback and I'll definitely bring this up with the other core devs when I meet with them in January.

educated_foo | December 16, 2009 12:58 AM | Reply

Just to try to add something constructive here... A bio-informatician considering BioPerl typically has some text, a powerful text-processor (Perl), and a need to extract some information and do a simple computation. For him to use BioPerl, it needs to offer something compelling enough to stop him from just whipping out a regex and tweaking it until it's "good enough." In some cases, e.g. with FASTA files, this can be difficult, since the file format is simple and regular.

I think doing more task-based reviews might help the rest of BioPerl. Find a biologist who isn't a developer, and see what happens when he looks into using BioPerl for a specific task. In my case, I wanted to get the secondary structure labels off the main strand of a protein. Other examples include: running PSI-BLAST on a sequence of interest and extracting the top hits (or using one of the other NCBI programs); building a multiple alignment with e.g. CLUSTAL-W or BEAGLE; finding the open reading frames in a piece of RNA and translating them to amino acids; feeding plain old text (or a home-grown file format) into BioPerl; and turning BioPerl sequence objects into strings or files in that home-grown format.

It would be great to turn these recipes into a cookbook in the BioPerl distribution. I'm not doing much bio stuff at the moment, but I (and I think other BioPerl outsiders) would be happy to write up our attempts to use it on day-to-day tasks.

pyrimidine replied to comment from educated_foo | December 16, 2009 2:31 AM | Reply

All of those tasks are do-able within BioPerl currently, so a decent 'cookbook' is probably the best way to go, starting with some of the above examples, and these could be included in the distribution (we have quite a few scraps and examples on the wiki currently, but it would also be nice to access these via perldoc directly).

educated_foo | December 16, 2009 4:25 AM | Reply

(Oops! I meant T-coffee, not Beagle. Both are written in Java, but only T-coffee does multiple alignments. Beagle is a phasing program.)

The wiki looks good, but whether or not they're doable is beside the point. The relevant question is "what happens when an average biologist tries to do them?" Does he find the right solution, or get lost in the documentation and interface complexity, and give up?

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About pyrimidine

I blog about Perl (and biology).

More info »

pyrimidine