Monday, June 30, 2014

The good, bad & missing from Bio* libraries?

As I mentioned recently, I've been exploring how I might use the emerging Julia language to solve problems.  While that requires a large amount of mental work, I see some potential gains, both in having more readable code than Perl as well as to potentially leverage a lot of high-level concepts for parallel execution that are built into the language.  But beyond the challenge of elderly canine pedagogy that I present, there is the issue that the BioJulia library is quite embryonic, with serious consideration of treating much of the existing code base as a first draft (or, that is the impression I get from skimming the Google group).  So I'm going to try to pitch in, despite my multiple handicaps.
But since the libraries are embryonic, perhaps there is a chance to really do some things right. Of the public libraries, I've mostly used BioPerl and a bit of BioConductor (R), but I've also at times designed my own libraries in C++, Perl and C# as well as some other languages.

Library design is not a task that should be taken lightly, as there are many aspects to consider and conflicts between them are inevitable.  I feel that foremost a library should be in the style and use the common idioms of the core libraries of a language and should take advantage of the strengths and facilities of that language.  So BioJulia should not, in this view, follow Professor Lehrer's advice ("Why do you think God made your eyes? Plagiarize, plagiarize, plagiarize!") and just port BioJava or BioRuby or BioPerl or BioPython over (I'm happy to see there still isn't a BioCobol).  Any new library effort should learn from predecessors without parroting them.  A little looking at some of the lesser-used languages in bioinformatics reveals a number of interesting bits in them. For example, SCABIO (Scala) has generic dynamic programming as part of the library.

With Julia, one possibility would be to identify some good C or Python code and import it -- the language is intended to make this easy.  Unfortunately, this doesn't extend to C++, which seems to be the language for most packages I'm interested in.  Even if a library is imported en masse, there still may be room to write higher-level wrappers on it. It doesn't matter that few groups are using Clojure or Haskell; what matters is that the groups that have been using them have been experimenting. 

Good class design, and good design of libraries of such classes, should make them easier to learn. Adopting popular patterns from other libraries is one way to do this, but also just doing things consistently is important.  For example, if your class looks like a collection it should act like one: have iterators and such and use common method names.  Some design gets tricky, and certainly can be a matter of personal taste.  In particular, are many layers of classes good encapsulation or an annoying set of Russian dolls?

Libraries also need to be designed with performance in mind, though sometimes performance and other attributes must be traded off.  Figuring out appropriate boundaries of libraries and classes can also be tricky. Installation must be clean on many platforms, even if some underlying C code is needed.

So, I throw this out to the audience: what do you like about existing libraries?  What do you dislike?  What are the most glaring missing pieces?  It would be really amazing to hear from folks who have participated in designing or are really familiar with the different design routes taken by different libraries.  I'll stick to BioPerl, as that is what I interact with most & seem to remember my gripes, though I'm sure I've just acclimated to some.  

Just to seed the conversation, a few of my gripes with BioPerl.  Now, please don't take this the wrong way -- I appreciate deeply how much time and effort went into these and that no designer is omniscient nor is code necessarily used in the way it was initially intended.  I've tried to erase the ones where a little bit of digging proved that the "missing" feature was missing only in my mind, but it wouldn't shock me if some remain.

One module I've definitely wrestled with is the Samtools module for Perl; installation has misfired more than once for me, and I've also found the performance to be quite slow.  So for many things, I've fallen back on just parsing SAM directly, which has it's flaws. This also underscores the need for libraries to be easily installable, which can be a challenge if they rely on C or other libraries or have complex chains of dependencies.

CIGAR support: CIGAR is a useful way to represent alignments compactly, but I haven't found any BioPerl that really allows you to operate with these -- e.g. given a CIGAR, pad out a sequence.  Reverse a CIGAR.  Create a CIGAR from an alignment.

Bio::SearchIO.  This is the module that deals with BLAST and other searches. One thing I dislike is an insufficient (IMHO) degree of class definition / too many quoted keywords to do important things. If you want the start of an alignment, it's $obj->start("query") or $obj->start("hit") and there's parallel methods for the end and the strand and the sequence line etc. I'd rather have objects that hold this information that then have start, end, strand etc methods. 

I've generally had problems with the alignment objects in Perl -- the scheme has never quite worked for me.  I need to revisit them just to figure out what my complaint has been, but I don't actually use them.  Again, some of that is more a problem between my ears than with the library, but perhaps there are useful rethinks there.

Bio::Seq has some flaws that trip me up.  In particular, subseq returns a string rather than another Bio::Seq object.  So if I want to perform further sequence operations on subseq, I need to construct new objects. Languages such as Julia may make this distinction a bit less troublesome -- it's typical to provide a conversion method to String.

The Primer3 interface in BioPerl is a good example of design choices I wouldn't make, though perhaps others like them.  The Primer3 objects encapsulate almost nothing; you interact with Primer3 by using all its keywords.  This is infinitely flexible and adaptable to changes in Primer3, but it also means all that Primer3-specific stuff is embedded in your code.  I always end up writing a higher-level interface to Primer3; why not just have it built into the language?

Counting Bloom filters.  Julia has a Bloom filter library, but it doesn't do counting.  For those who haven't bumped into them, Bloom filters are a probabilistic hashing scheme that handles huge sparse data sets better than standard hashing techniques.  The catch is that Bloom filters have a chance of false positives: if you load a Bloom filter with elements and then ask whether a given element is present, it may sometimes falsely report something to be present that it has never seen (on the other hand, there is a guarantee of no false negatives).  Counting Bloom filters count the number of occurrences, and show up a lot with kmer analysis of large sequencing datasets.

Dynamic programming -- as mentioned above, the Scala biology library has this.  I've started sketching out a simple Smith-Waterman in Julia, as writing S-W is a good programming exercise in any new language, but SCABIO goes far beyond that basic. Even if it isn't suitable for the biggest jobs, a good abstracted DP engine could be useful for a lot of problems.

Lab stuff.  Something I haven't seen much of in Bio libraries (though perhaps I just haven't looked hard enough) are classes for interfacing with the real world.  One sort of library I end up writing, and have started sketching out for Julia, is for handing 96 & 384-well plates (and sufficiently abstracted to handle other geometries). This includes working in the native coordinate system (A1, B2) -- or with leading zeros (A01, A02), iterating across plates in logical ways, etc.  Libraries to control standard robots would be good -- or perhaps some higher level abstractions to mate to either commercial devices or Arduino ones.


Scholander said...

You probably already know, but in Bio::Seq you can use trunc() instead of subseq() to return a sequence object...

Totally agree about BioPerl's handling of alignment objects. I'm about to have to use it again for a project and I'm dreading it.

Personally, I'd like to see more robust handling of the GenBank format. It's stupid and archaic and loosely structured at best, but until we get our coworkers away from commercial packages, it's the rich sequence format we're stuck with.

Rick said...

It would be nice to support gzip format without the need to specify it. In BioPerl they suggest using:

-file => 'gunzip -c |'

But that is awkward and prone to breakage if your gunzip is not in your path. I know that it is possible to write programs that examine the gzip'ness of a file and deal with it automatically.

Geneticist from the East said...

I think BioPerl has decent Dynamic Programming module called Bio::Tools:dpAlign. It implements Miller-Myers algorithm which is analytically equivalent to Smith-Waterman but runs faster and consumes only linear memory space. Maybe you can check it out and port it to Julia?