For nearly the first decade of my bioinformatics career I carried around a dirty little secret -- well, at least at times I felt it was one. I had coded many things, I could explain many algorithms, but I had never coded a dynamic programming alignment algorithm -- the core to so much I did. I had slightly hacked one version (just to have it do an all-all comparison of a database, doing each possible pairing only once). Finally, for a bunch of reasons, I sat down and did it -- my very own Smith-Waterman implementation.
I'm reminded of this because a couple of weeks ago I rolled back my sleeves and knocked one out again. Now, just the fact I did this reveals a bit about me. I did find at least two freely available C# implementations on-line (e.g. the C# version of JAlign) and there is a plethora of C implementations. There is also Ewan Birney's magnificent Dynamite, pretty much the catch-all for the field (Dynamite is a programming kit for doing this; in effect a programming language for dynamic programming). But, partly as a point of pride & partly because I saw I'd need to hack the one C# copy I looked at in detail, I did it. I even wrote a schmancy version -- a simple cDNA to genomic sequence aligner with two classes of gaps (one being an intron, with a really trivial model of a splice junction -- I think it used dinucleotides) All coded in Perl -- no speed demon, but it solved the problem where we needed it.
Now, it took me a good few hours to do it -- better than the few days of the first time, but not instant. I can claim that this time I didn't fall back on any study aids, such as the many online descriptions or Eddy & Durbin's & co. very well written book.
The implementation says a lot about me too. I thought of many ways to code it and finally settled on one. For example, there is the question of how to represent the alignment matrix; I used a two-dimensional array scheme (actually implemented using dictionaries -- a holdover from my Perl-centric days) but I could have also made it a graph of nodes. There is also the actual thrashing through the matrix -- the algorithm is inherently recursive, but following familial idiosyncracies I wrote the code to use loops -- well, actually I completely waffled and implemented so it can use recursion, but actually loops through! The applications I'm considering are going to be short alignments, so I didn't worry about memory efficiency (who wants to be that will bite me back!) nor did I fixate on speed (care to double the bet?) -- indeed, I wrote it to allow all sorts of baroque variations, such as different penalties for opening gaps in the two different sequences & for basic profile-to-sequence alignments. Plus it is either Smith-Waterman (local) or Needleman-Wunsch-Sellers (global), with a simple toggle.
So now the pitch: If you are a bioinformatics programmer & you haven't written one, I urge you to do it. It's great practice & nothing illustrates an algorithm like trying to implement it. If you don't consider yourself a programmer, guess what? It's perhaps not the obviously easy first start, but just thinking about it will stretch your mind. Plus, you get a free bioinformatics Rorschach test from your implementation choices!
One last thought: who can think up (and execute) the most comically baroque -- but functional -- implementation of S-W/NWS? Has it already been done in PostScript? How about in a relational database (I've written some pretty baroque SQL this year, but I doubt I could tackle this)? S-W as an Excel spreadsheet? Coded with glider guns? A full description for a true Turing machine? Of course, the grand prize winner would clearly either be to build a DNA computer to compute an alignment -- but perhaps that could even be topped by implementing the algorithm with living cells as the alignment cells!