Monday, August 25, 2008

The Joys of DIY Dynamic Programming

For nearly the first decade of my bioinformatics career I carried around a dirty little secret -- well, at least at times I felt it was one. I had coded many things, I could explain many algorithms, but I had never coded a dynamic programming alignment algorithm -- the core to so much I did. I had slightly hacked one version (just to have it do an all-all comparison of a database, doing each possible pairing only once). Finally, for a bunch of reasons, I sat down and did it -- my very own Smith-Waterman implementation.

I'm reminded of this because a couple of weeks ago I rolled back my sleeves and knocked one out again. Now, just the fact I did this reveals a bit about me. I did find at least two freely available C# implementations on-line (e.g. the C# version of JAlign) and there is a plethora of C implementations. There is also Ewan Birney's magnificent Dynamite, pretty much the catch-all for the field (Dynamite is a programming kit for doing this; in effect a programming language for dynamic programming). But, partly as a point of pride & partly because I saw I'd need to hack the one C# copy I looked at in detail, I did it. I even wrote a schmancy version -- a simple cDNA to genomic sequence aligner with two classes of gaps (one being an intron, with a really trivial model of a splice junction -- I think it used dinucleotides) All coded in Perl -- no speed demon, but it solved the problem where we needed it.

Now, it took me a good few hours to do it -- better than the few days of the first time, but not instant. I can claim that this time I didn't fall back on any study aids, such as the many online descriptions or Eddy & Durbin's & co. very well written book.

The implementation says a lot about me too. I thought of many ways to code it and finally settled on one. For example, there is the question of how to represent the alignment matrix; I used a two-dimensional array scheme (actually implemented using dictionaries -- a holdover from my Perl-centric days) but I could have also made it a graph of nodes. There is also the actual thrashing through the matrix -- the algorithm is inherently recursive, but following familial idiosyncracies I wrote the code to use loops -- well, actually I completely waffled and implemented so it can use recursion, but actually loops through! The applications I'm considering are going to be short alignments, so I didn't worry about memory efficiency (who wants to be that will bite me back!) nor did I fixate on speed (care to double the bet?) -- indeed, I wrote it to allow all sorts of baroque variations, such as different penalties for opening gaps in the two different sequences & for basic profile-to-sequence alignments. Plus it is either Smith-Waterman (local) or Needleman-Wunsch-Sellers (global), with a simple toggle.

So now the pitch: If you are a bioinformatics programmer & you haven't written one, I urge you to do it. It's great practice & nothing illustrates an algorithm like trying to implement it. If you don't consider yourself a programmer, guess what? It's perhaps not the obviously easy first start, but just thinking about it will stretch your mind. Plus, you get a free bioinformatics Rorschach test from your implementation choices!

One last thought: who can think up (and execute) the most comically baroque -- but functional -- implementation of S-W/NWS? Has it already been done in PostScript? How about in a relational database (I've written some pretty baroque SQL this year, but I doubt I could tackle this)? S-W as an Excel spreadsheet? Coded with glider guns? A full description for a true Turing machine? Of course, the grand prize winner would clearly either be to build a DNA computer to compute an alignment -- but perhaps that could even be topped by implementing the algorithm with living cells as the alignment cells!

6 comments:

Randy said...

I understood most of what you posted, and I hate to be the one to ask the "dumb" questions, but I'm confused by a couple of things. (I'm not a programmer or knowledgeable about bioinformatics, so these are honest questions):

Since alignment algorithms already exist, what problem were you solving with your own algorithm?

With your algorithm is there a difference in speed, or a difference in the type of results returned?

Do you see any trends emerging from alignment algorithms that would lead to an understanding of gene transcription and regulation?

I appreciate your blog, thank you for another informative post, Randy

Keith Robison said...

It's not at all a dumb question.

Dynamic programming is a general strategy for optimization -- it is guaranteed to give you the optimal solution given the parameters & statement of the problem (the Wikipedia entry is a good place to get an overview).

Computing the best alignment between two sequences is a problem which can be solved by DP, and indeed has been many times. However, the statement of the problem here can have many subtleties, and that was partly why I wrote my own.

For example, in the most common simple formulation you have a score for a match, a score for a mismatch, a penalty to open a gap and a penalty to extend a gap.

An obvious & common extension of this is to have different mismatch & match penalties depending on what two characters are being compared (substitution matrices). For example, in a protein alignment you will probably penalize hydrophobic swaps (e.g. Leu for Val) much less so than hydrophobic for charge (e.g. Leu for Arg).

My particular implementation should support the further extension that each amino acid position has its own substitution matrix (a sequence profile) & own gap penalties.

It would also support different classes of gaps -- e.g. inserting an intron vs. inserting a small insertion or deletion. Birney's Wise programs even had as many as 3 gap classes (small nucleotide indel, insertion/deletion of entire codons, introns). Both his Dynamite & Wise papers are good reading -- as is the Eddy et al book I cited in the text.

I won't claim to be an expert on the trends, but there are still some very active areas. Multiple sequence alignment by DP is intractable for more than 3 sequences (I'm pretty sure that's the current limit), so approximate methods must be used and these are still evolving (previous post on the subject).

Another still evolving field is the problem of spotting conserved regulatory elements near genes -- since these may not always be in the same order.

Another example of recent alignment flavor is the recent Nature report of a discontinuous hammerhead ribozyme in a mammalian gene.

Cheers!

Randy said...

Thank you! Food for thought and some directed reading that will keep me busy for a while, I appreciate it! Randy

Aaron J. Mackey said...

For PostScript, I'm afraid it's already been done:
http://bioinformatics.oxfordjournals.org/cgi/content/abstract/9/4/421

Keven said...

Agreed -- should be "required reading" for all the bioinformatically-minded out there. DP-ing for myself was somewhat personally gratifying. A bit akin to the "thrill of the hunt" probably. Not that I'm the hunter type, mind you. Perhaps that's why I am replying to your blog instead of stalking Odocoileus. BTW, keep up the great blog.

Ian Holmes said...

Nice article. For anyone seeking a generic DP engine in the mould of Dynamite, these days I would recommend Gerton Lunter's "HMMoC" (Hidden Markov Model Compiler).

FWIW, I wrote the technical appendix to the the Dynamite documentation that you posted, and have coded up a number of bioinformatics automaton compilers & interpreters (including "phylocomposer" which is a transducer compiler)

Cheers
Ian