What are BioPerl's weaknesses?

(Note: this is a repost due to some of the beta-ness of the site).

BioPerl is a commonly-used toolkit for bioinformatics, but it does have it's share of problems. As educated_foo has pointed out, it can be slow, documentation is spotty for some modules, classes can be overly complex in some cases, and there are interface issues. And, for simple formats like FASTA it's probably overkill if you are just parsing data. Not to mention the monolithic nature of the beast; I think it's over 1000 modules now.

We in the BioPerl community haven't heard much about these issues from users. What (specifically) bugs you about BioPerl? What could be improved? In other words, what are it's weaknesses? Specifics would be nice.

(Just for some balance, I'll blog about some of BioPerl's strengths at a later point)

Rattling off a couple, from educated_foo and myself:

  • Make it less monolithic. Break up 'core' into more manageable bits.
  • Simplify classes.
  • Better tests?
  • Maybe a BioPerl Manual? (we do have HOWTOs that could work as starter material...)
  • Moose? Some of those interfaces fit nicely into Roles...
  • Allow parsers to return simple data structures (hash refs?) as well as objects.
  • Re-evaluate interface? educated_foo mentioned problems, but specifics would be nice.



I gave a detailed critique on reading PDB files on the previous entry. Maybe someone has a cached copy.

Over-engineered design and poor documentation were the main points. I think removing Bio::Structure from BioPerl would be the easiest solution. The loss of intended functionality would suck, but it's too poorly implemented to be useful now.

Just to try to add something constructive here... A bio-informatician considering BioPerl typically has some text, a powerful text-processor (Perl), and a need to extract some information and do a simple computation. For him to use BioPerl, it needs to offer something compelling enough to stop him from just whipping out a regex and tweaking it until it's "good enough." In some cases, e.g. with FASTA files, this can be difficult, since the file format is simple and regular.

I think doing more task-based reviews might help the rest of BioPerl. Find a biologist who isn't a developer, and see what happens when he looks into using BioPerl for a specific task. In my case, I wanted to get the secondary structure labels off the main strand of a protein. Other examples include: running PSI-BLAST on a sequence of interest and extracting the top hits (or using one of the other NCBI programs); building a multiple alignment with e.g. CLUSTAL-W or BEAGLE; finding the open reading frames in a piece of RNA and translating them to amino acids; feeding plain old text (or a home-grown file format) into BioPerl; and turning BioPerl sequence objects into strings or files in that home-grown format.

It would be great to turn these recipes into a cookbook in the BioPerl distribution. I'm not doing much bio stuff at the moment, but I (and I think other BioPerl outsiders) would be happy to write up our attempts to use it on day-to-day tasks.

(Oops! I meant T-coffee, not Beagle. Both are written in Java, but only T-coffee does multiple alignments. Beagle is a phasing program.)

The wiki looks good, but whether or not they're doable is beside the point. The relevant question is "what happens when an average biologist tries to do them?" Does he find the right solution, or get lost in the documentation and interface complexity, and give up?

Leave a comment

About pyrimidine

user-pic I blog about Perl (and biology).