Monday, November 16, 2015

Do Demons Dream of Phylogeny Packages?

Miserable day today  - spent my entire day wrestling with bad formats and flaky tools and trying to bull my way past them, leading to many a mad expostulation. The whole day down in the pit, with the pendulum of multiple deadlines swinging just over my head. The MBTA released new schedules that muck with my routines.  And the then to top it off, Mick Watson writes a piece titled "The Five Habits of Bad Bioinformaticians" that cuts far too close to home.  So I arrived home in a foul mood, my senses unpleasantly heightened to every sound.
Mick's piece is to be borne as best I can, but if he writes 999 with such bad timing I may invite him to toast his health with some fine sherry. In the abstract, I can agree with all of them -- there is almost always a tool already built to do what you want and you should always research and use the best tool out there.  But, doing so under time pressure isn't easy, especially when you feel that a lack of attention to multiple projects will usher their houses to fall.

For example, since I work at a company, the first cut for any software is whether it is openly licensed.  Sure, in theory I could get a license drawn up, but by the time that happens I've long since moved past.  Even if university technology licensing offices worked quickly, they often have exaggerated opinions of a software's worth.  This was brought starkly home in my first paid position, when I tried to license some software which I knew was useful but full of bugs.  Technically I had a conflict-of-interest, but the reality is that even at the exorbitant price Harvard wanted for my code, I wasn't going to see a dime.  Worst, often one must deal with the delay of licensing simply to evaluate code.

Now, I do try to be a good citizen and never use code that doesn't have an explicit license allowing me to do so.  Sometimes that's hard to find -- it's tres annoying to download a tar.gz file and unpack it and dig through a few directories and then find an exclusionary license. 

Let's go back to today's nightmare.  I'm in the middle of trying to generate some pretty phylogenetic trees marked up based on metadata for the sequences and with confidence information on the tree topology.  Doing this well often involves a cycle of aligning the data, marking up the tree and then discovering some glitch in the input data or the metadata. 

Since this is what a lot of folks do, there should be great tools out there, right?  Perhaps lying around in plain sight?  Perhaps, but that's not my experience.

First, there's a plethora of programs for each stage of the process.  Multiple aligners for protein?  Well, there's Clustal Omega, MUSCLE, MAFFT and probably a few dozen more.  Each offers a different array of possible alignment outputs.  Then a wealth of tree generation programs, with again a raft of formats.

Phylogenetic formats: -- I feel immured by them.  There's Newick, named after New Hampshire seafood restaurant (which is why it is sometimes called New Hampshire format). There's an extended version of Newick.  There's Nexus format.  Two different XML standards: PhyloXML and NeXML. The venerable PHYLIP format.  And that's the tip of the iceberg.

I've seen things you people would not believe. The first problem, beyond the sheer cacophony of different formats, is that different programs support different ones -- and often badly.  For example, the Mr.Bayes software for estimating tree confidence (for large trees, in geologic time, unless it crashes), will write in Nexus format -- and then refuse to read its own output!  Perl's Bio::TreeIO happily generates XML files that many other programs won't read, complaining about tags that don't belong -- somebody is just plain wrong here!  Ditto the various tree viewers / editors that refused to consume the XML generated by upstream programs.  And at least one of these packages insists that everything after the angle bracket in a FASTA file is part of the unique identifier, which it then has the temerity to complain contains spaces!

Yeah, that confession -- I'm still using Perl.  Perhaps that falls under Mick's category of sticking with obsolete tools  The problem is the grooves seem worn too deeply into the brain of this canis venerablis, who also resists the obvious slide over to Python for dabbling in not-quite-ready for production bioinformatics Julia (or Scala).  But it is clear that the PhyloXML support is limited in Perl, with only a small amount of the expressivity encapsulated.  Could be worse: I had to fix typos in the PDB module last year.  This is to me one of the signs that Perl is decaying as a bioinformatics language: the core sequence format modules were built well (not that I always love their design) , but get out into other important areas of bioinformatics and the modules are both limited and flaky.

Actually, today I thought I'd try some Python -- perhaps a good Python library for phylogenetics could get me started.  I've done a little bit of Python off-and-on.  Well, first I had to fix a typo in the library (arrrgh!), and then it wouldn't read the Newick output from FastTree.  Aiiiiyeee.  A few more rounds of that and I'll be indistinguishable from a murderous orangutan.  On the way home I thought of a new tack -- there's always R & an R package for phylogenetics that crossed my Twitter stream recently.  Salvation, or new madness? -- I dread to risk it.

The expressivity of some of these formats leaves a lot to be desired, though I can't quite slam PhyloXML as I'm still trying to digest it.  But certainly the Perl libraries don't have any higher order notion of annotating a tree.  Ideally I'd be able to define formats by name and assign them, but if you're running a procedure three times to individually set the RGB color codes for a node or edge, you're not in the world of higher abstractions.

So, given all these problems, the temptation to roll-my-own XML-based parser was a bit too tempting, so I had the wild audacity to violate another of Mick's guidelines -- though this led not to a perfect triumph. But, since most of what I wanted to deal with isn't really XMLified, but rather exists as its own syntax within text blobs, that perhaps didn't gain me much.  And perhaps I'll confess also to just trying to directly insert some of the markup code in the XML without a proper parser. 

Can I escape these multiple things of evil which flock to me, or will flaky and ill-behaved phylogenetics software continue to throw a shadow on my floor?  If I don't finish this and move on to something different, then my soul from out that shadow shall be lifted...

5 comments: said...

Maybe you should use one of those treefinder alternatives.

Keith Robison said...

Yes, being a good net citizen means using software only when licensed and not using when proscribed, no matter how morally repugnant the licensor is -- I'm glad I had never started using TreeFinder, but I would not have hesitated to stop using it if I had.

DanU said...

Sorry, Keith. This is super-painful to read when I suspect that you're doing all this to update or extend stuff that I did a while back. You've hit all the same problems as I did more than a year ago, and it's sad that things haven't changed at all! We know how some people love phylogenetic trees, but all the modern tools are in clustering algorithms and their display, maybe it's time to force a paradigm shift at the company...? Heh.

Heng Li said...

Well, it sounds that XML is the root of all evils.

Keith Robison said...

I'm actually a fan of XML, in that it eliminates entire classes of parsing errors and in particular since a given document can be validated. That said, XML (and HDF5 and ASN.1 and YAML and...) doesn't fix the problem of semantic content, but that's a whole stalled post...