Friday, January 27, 2017

Perl: The Bad Habit I Can't Quite Kick

TULIP is a new assembler for long, error-rich reads such as from nanopore. I was a bit stunned to see that TULIP is written in Perl; I was starting to wonder how many holdouts like me there were. Which led to this exchange on Twitter

The Problem with Perl

Well, to me Perl is a bit broken.  There's always been the sheer catastrophic mess of its design, or lack thereof.  My oldest brother, whose opinions I value highly, simply groans that Perl is "pure bad taste".  Of course, taste is, well, a matter of taste, but Perl just doesn't have a well-designed feel to it.  

What's serious though is the rather frequent discovery that Perl libraries are simply outdated, broken or non-existent.  For example, I switched over to Python for the head end of my Twitter extractor, as the Perl library wouldn't work for me.  If I remember correctly, it was a security model behind or so. 

 Similarly, back in summer 2014 I needed a tool to extract FASTA data from MinION FAST5 files.  No tool yet existed within the community.  But again a library failure; my memory is foggy now but I think the Perl HDF5 library couldn't handle some key features.  Either that or it crashed.  So I tested out Julia instead, which worked and I released the first tool for the task within the nanopore community.  I later let that lapse; Nick Loman's Poretools and Mick Watson's poRe are both excellent and I let them manage the headache of supporting an evolving format.

There's more.  I tried out the PDB library back in 2014; a key function required a trivial fix.  Trivial, but still non-functional as delivered.  There was some library I looked into a week or so ago that was just a pilot, but not updated since 2011.  Googling "FASTG Perl" appears to be a dry hole.  Etc.  It's just too common and too frustrating to frequently discover that newer formats aren't available in Perl.  Yes, I could be the one to write them, but those sorts of initiatives usually get dragged down by "gotta get work done now" issues.  Much easier to find workarounds, even when they are ugly.  I could have done that with FAST5; as someone pointed out on Twitter when I griped about this I could  have just piped the output of the command line HDF5 dumper and read that into Perl.  But those sorts of solutions start getting frustrating and slow, though I use them frequently.

Then there's the incomplete stuff that I'm just not up to working around.  Actors are an interesting way to manage multithreading; design your objects correctly and just throw them from actor to actor.  Still not simple, but better than dealing with mutexs and semaphores and such.  I played with them in Scala back at Infinity. There is a Perl lightweight threading library to implement actors, but it comes with warnings that it isn't a final version but a research project.  The last copyright date is 2011, suggesting that no further versions are pending.  That doesn't breed confidence.

I almost vowed to bail on Perl when I joined Starfleet, but the press of getting stuff done quickly wore down my resolve.  That was 2011 and Perl's stagnation wasn't quite so clear. A decision I regret deeply, as now I have a ton of working code I rely on that I sure am not going to port to another language, so any switch wouldn't be going cold turkey. Plus, that's not even remotely realistic; there's always those handy one-liners.

There's also the problem that the language is burned into my brain; I've been using it for two decades now.  If someone presents a problem, my mind starts spinning code.  When programs go south, I often have intuition about what went wrong. I can read between the lines on error messages, often quickly getting to the problem.

A skeptic could quickly knock down several of those.  I've gotten fluent in other languages, sometimes very quickly.  I learned C# in no time at Codon.  Better designed languages would set fewer traps for me, meaning fewer bugs and fewer error messages.  I may be getting a little gray around the snout, but I'd like to think I can still learn some new tricks. 

Programming language discussions are apt to melt down into pointless flaming.  I guess since I'm dissing Perl and saying my complaints about some others, that might be expected, but I hope any comments will tend more towards light than heat.  Hope springs eternal, right?

Perl6 - Bah!

But what to switch to?  Don't even ponder contemplating thinking about suggesting Perl6.  First, I've never heard of anyone in bioinformatics using it.  Second, all reports I've seen is that it is an unholy mess, trying to do everything for everyone while staying somehow backwards compatible.  Most importantly, it wouldn't solve the library problem.

The worst part of Perl6 is that it sucked away a lot of developer talent from Perl5.  I suspect Perl6's tortured, seemingly unending gestation also drove a lot of developers away, or scared off new recruits.

PHP - Abandon All Hope Ye Who Program Here

Speaking of awful language design, there is a deep pit in hell reserved for PHP.  Down where Satan's wings have adiabatically frozen the water. I had to look through an ex-colleague's PHP code recently, and it simply reinforced my belief that the world would have been better off with some better Perl libraries.  

C# - Riiiiiiight

I really, really liked C#.  Great language.  But academia gives it the leper treatment, probably since it came from Microsoft.  Library problem again; there is a NET.Bio library but skimming it in 10 seconds doesn't give me encouragement. 

Python - Resisting the Obvious

Python is the obvious choice, and perhaps I resist it because its too obvious.  I've dabbled around in Python, first at Codon Devices, where one developer used it exclusively.  Codon was a real menagerie, with one segment of code in Python, another whole swath in C# and then some of my stuff in Perl and a touch of R (also me).  They all talked to each other mostly through a relational database, though a few other mechanisms existed (COBRA! We actually used COBRA! I think -- seems to bizarre to be real). Oh, and also via IronPython, a C# to Python interface that I played with.   My talented intern at Infinity used Python, and I could read it just fine.  At Starbase I've tweaked a couple of outside packages, but that Twitter grabber is my main bit of Python and its tiny.  

I think I've gotten over my original negative reaction to the "indents matter!" business, though it still frets me a bit since I so often end up refactoring code by adding loops or removing loops or splitting loops across procedures or other things which would seem to change the indentation level.  Perhaps a good editor helps with that, though to be honest with any new editor I tend to learn the bare minimum (the pinnacle of that is vi/vim -- I know how to get out of it!).

R - Why Can't I Love It?

I've periodically dived into R, even used R graphics in a book chapter written while I was at Millennium.  Dabbled a bit at Codon, Infinity and now with the Federation.  Sometimes that's just cookbook analysis of expression data, but I have written a bit of code.

Somehow it just never catches fire.  I haven't quite caught the Zen of R or something.  Dataframes and I aren't yet best buds.  I can't really describe it properly -- or perhaps I'm too embarrassed that my problems might not be very rational - but R just hasn't (yet?) fit me.

Julia: Not Ready for Bioinformatics Prime Time?

R and Python have huge bioinformatics communities, which argues strongly for them.  Julia is a potential up-and-comer.  Lots of interest in scientific computing fields, that important sibling endorsement (Arch was on the JuliaCon organizing committee for several years), interesting possibilities ahead of it with many serious developers 

When I looked at BioJulia a few years ago, though, the library complement was an issue.  Still, there was an energetic bunch of folks participating in the development message boards. I've looked again and progress has been made, but (again, from a very fast scan) it appears a lot of libraries are missing from BioJulia, such as parsing Genbank. Can't blame them too much for avoiding writing a Genbank parser, an exercise in misery from trying to anticipate all the deviations from an ill-specified standard.   Now in theory Julia can be made to use Python libraries, though I haven't figure that out.  But I don't really want to start with one library set and then switch later on.  That could be camouflaged a bit by writing a bunch of wrapper libraries, but that's still some medicine I'd like to avoid swallowing.

What Else?

I dabbled in Scala back at Infinity, which was interesting and I did get some actors working (as noted above), but again it's not caught fire for bioinformatics.  Let's face it, nearly every language has been tried in bioinformatics and somewhere there is a BioX library for nearly every language X  Except Cobol, thankfully.  My father brought me up to hate three things in life: BASIC, Cobol and Fortran, though he relented when BASIC was the only language available on some early PCs.  But few of these efforts have much there there.  I could see the personal growth advantages of trying out Haskell.  I've never learned a LISP-family language; that's a hole in my development perhaps I should fix. Maybe I should play with some newfangled over the internet sensation like Go.  I did Java twenty years ago; the code just looks ugly to me now.  Too much work to do very little things.  I had no flair for C or C++; managing my own memory is a headache I have no interest in.


Thus does analysis paralysis set in.  There's always some exciting science to read up on.  There's always science I need to read up on for my job.  Ideally those two overlap, but I won't claim perfection there.  Databases to curate, meetings to attend.  Blogs to write.  Primers to design.  Experiments to design, experiments to analyze.  All the fun chaos that is my bioinformatics life.  So I'll slog it out a while longer with Perl and find those workarounds for missing libraries.

Or maybe I'll switch to the runner-up in some programming language survey for last year, a striver that popped +0.91% in their arbitrary measure.  Yup, the clear up-and-comer is a Practical, Eclectic Rubbish Lister.


Anonymous said...

I used to be a C++ programmer (from the very first version of C++ at Bell Labs), but I gave up on it a few years ago—old programs would no longer compile and the maintenance work to keep up with the constantly changing language was too much.

I try to program in Python in a way that will work in both Python 2.7 and Python 3 (using the frozen "from __future__ import …" statements, but I suggest Python 2.7 as the appropriate language to program in—it is frozen now, so things will continue to work (or not work) the same way no matter what brainstorm the language developers have in Python 3.

I find Python an excellent tool for rapid prototyping (which is what most research code is) and for simple user interfaces. The numpy and scipy packages make it suitable for medium-scale scientific programming, and Cython can recover a good chunk of the interpreter overhead. It is not (currently) a good programming language for GUIs, as the GUI packages are all incompatible and mostly arcane.

I see no reason to write a new program (even a one-liner) in PERL. You may be stuck with legacy code in PERL, but there is no excuse for creating more of it.

Jonathan Badger said...

In terms of scripting languages, there is also Ruby, with its BioRuby. Many people think Ruby is just a web-thing, but it is a general purpose scripting language that just seems cleaner and more consistent than Perl or Python (and doesn't have the silly white-space dependency of the latter).

John Didion said...

I switched from industry to acadamia in 2008 as a Java programmer, and it was clear that wasn't going to fly in bioinformatics. So I quickly taught myself perl, which at the time seemed to be synonymous with bioinformatics. I had to hold my nose, but when in Rome...

A few years later I looked at the landscape again and fortunately python had come a long way. Additionally, I realized I needed to learn R for statistical analysis and creating publication-worthy figures. I've been very happy with the [data processing in python] --> [data analysis and visualization in R] stack ever since. Python 3 is a huge upgrade in terms of performance and just the overall quality and consistency of the language and libraries. I've written code in Py3 (with Cython extensions) that performs similar to or better than similar tools written in C++ (see The breaking changes from py2 to py3 are frustrating but I think necessary for the long-term health of the language.

On the horizon, I think Julia is definitely going to be my next language. It will reach maturity in 2-3 years, and have solid library support within 5 years, such that it can replace both python and R in my stack. The other language I'm very interested in is Rust, specifically as a replacement for C++/Cython for optimizing the slowest bits of my python code. Also, I think Javascript will grow in importance for interactive visualization, but for that purpose it will mostly be accessible via python bindings (e.g. Bokeh).

Lucas van Dijk said...

Please, stop recommending to start new projects in Python 2. Python 3 is a better language, with a more organised standard library, which is more iterator based (memory efficient) and with very nice new features as async functions and type hints. There are hardly any libraries without support for Python 3, and more and more libraries will drop support for Python 2 in the coming years, see Python 3 has been released 8 years ago. It's time to make the switch (and it is worth it).

Unknown said...

First, I must agree with Lucas. Python3, please.

As someone who knew a bit of Python and R, I would very strongly urge you to at least try Go. Its almost as expressive as Python, but it is simple and very readable. Being statically-typed and compiled actually helps noobs. Standard library, community are great. Plus, once you get into more modern features such as concurrence and easy cross-compiling, you'll be hooked.

Unknown said...

What's wrong with Java? Now particularly with Java 8.

Kyle Lesack said...

I have to agree with Lucas about Python 3. In addition to his points, I'd add Unicode handling as well. Python 2 has terrible Unicode support, which is needed for a lot of data science applications. The Unicode handling in Python 3 is drastically improved (UTF-8 is the default encoding).