Wednesday, January 27, 2010

The Scala Experiment

Well, I've taken the plunge -- yet another programming language.

I've written before about this. It's also a common question on various professional bioinformatics discussion boards: what programming language.

It is a decent time to ponder some sort of shift. I've written a bit of code, but not a lot -- partly because I've been more disciplined about using libraries as much as possible (versus rolling my own) but mostly because coding is a small -- but critical -- slice of my regular workflow.

At Codon I had become quite enamored with C#. Especially with the Visual Studio Integrated Development Environment (IDE), I found it very productive and a good fit for my brain & tastes. But, as a bioinformatics language it hasn't found much favor. That means no good libraries out there, so I must build everything myself. I've knocked out basic bioinformatics libraries a number of times (read FASTA, reverse complement a sequence, translate to protein, etc), but I don't enjoy it -- and there are plenty of silly mistakes that can be easy to make but subtle enough to resist detection for an extended period. Plus, there are other things I really don't feel like writing -- like my own SAM/BAM parser. I did have one workaround for this at Codon -- I could tap into Python libraries via a package called Python.NET, but it imposed a severe performance penalty & I would have to write small (but annoying) Python glue code. The final straw is that I'm finding it essential to have a Linux (Ubuntu) installation for serious second-generation sequencing analysis (most packages do not compile cleanly -- if at all -- in my hands on a Windows box using MinGW or Cygwin).

The obvious fallback is Perl -- which is exactly how I've fallen so far. I'm very fluent with it & the appropriate libraries are out there. I've just gotten less and less fond of the language & it's many design kludges (I haven't quite gotten to my brother's opinion: Perl is just plain bad taste). I lose a lot of time with stupid errors that could have been caught at compile time with more static typing. It doesn't help I have (until recently) been using the Perl mode in Emacs as my IDE -- once you've used a really polished tool like Visual Studio you realize how primitive that is.

Other options? There's R, which I must use for certain projects (microarrays) due to the phenomenal set of libraries out there. But R just has never been an easy fit for me -- somehow I just don't grok it. I did write a little serious Python (i.e. not just glue code) at Codon & I could see myself getting into it if I had peers also working in it -- but I don't. Infinity, like many company bioinformatics groups, is pretty much a C# shop though with ecumenical attitudes towards any other language. I've also realized I need as basic comprehension of Ruby, as I'm starting to encounter useful code in that. But, as with Python I can't seem to quite push myself to switch over -- it doesn't appeal to me enough to kick the Perl habit.

While playing around with various second generation sequencing analysis tools, I stumbled across a bit of wierd code in the Broad's Genome Analysis ToolKit (GATK) -- a directory labeled "scala". Turns out, that's yet another language -- and one that has me intrigued enough to try it out.

My first bit of useful code (derived from a Hello World program that I customized having it output in canine) is below and gives away some of the intriguing features. This program goes through a set of ABI trace files that fit a specific naming convention and write out FASTA of their sequences to STDOUT:

package hello
import java.io.File
import org.biojava.bio.Annotation
import org.biojava.bio.seq.Sequence
import org.biojava.bio.seq.impl.SimpleSequence
import org.biojava.bio.symbol.SymbolList
import org.biojava.bio.program.abi.ABITrace
import org.biojava.bio.seq.io.SeqIOTools
object HelloWorld extends Application {

for (i <- 1 to 32)
{
val lz = new java.text.DecimalFormat("00")
var primerSuffix="M13F(-21)"
val fnPrefix="C:/somedir/readprefix-"
if (i>16) primerSuffix="M13R"
val fn=fnPrefix+lz.format(i)+"-"+primerSuffix+".ab1"
val traceFile=new File(fn)
val name = traceFile.getName()
val trace = new ABITrace(traceFile)
val symbols = trace.getSequence()
val seq=new SimpleSequence(symbols,name,name,Annotation.EMPTY_ANNOTATION)
SeqIOTools.writeFasta(System.out, seq);
}
}

A reader might ask "Wait a minute? What's all this java.this and biojava.that in there?". This is one of the appeals of Scala -- it compiles to Java Virtual Machine bytecode and can pretty much freely use Java libraries. Now, I mentioned this to a colleague and he pointed out there is Jython (Python to JVM compiler) which reminded me of reference to JRuby (Ruby to JVM compiler). So, perhaps I should revisit my skipping over those two languages. But in any case, in theory Scala can cleanly drive any Java library.

The example also illustrates something that I find a tad confusing. The book keeps stressing how Scala is statically typed -- but I didn't type any of my variables above! However, I could have -- so I can get the type safety I find very useful when I want it (or hold myself to it -- it will take some discipline) but can also ignore it in many cases.

Scala has a lot in it, most of which I've only read about in the O'Reilly book & haven't tried. It borrows from both the Object Oriented Programming (OOP) lore and Functional Programming (FP). OOP is pretty much old hat, as most modern languages are OO and if not (e.g. Perl) the language supports it. Some FP constructs will be very familiar to Perl programmers -- I've written a few million anonymous functions to customize sorting. Others, perhaps not so much. And, like most modern languages all sorts of things not strictly in the language are supplied by libraries -- such as a concurrency model (Actors) that shouldn't be as much of a swamp as trying to work with threads (at least when I tried to do it way back yonder under Java). Scala also has some syntactic flexibility that is both intriguing and scary -- the opportunities for obfuscating code would seem endless. Plus, you can embed XML right in your file. Clearly I'm still at the "look at all these neat gadgets" phase of learning the language.

Is it a picnic? No, clearly not. My second attempt at a useful Scala program is a bit stalled -- I haven't figured out quite how to rewrite a Java example from the Picard (Java implementation of SAMTools) library into Scala -- my tries so far have raised errors. Partly because the particular Java idiom being used was unfamiliar -- if I thought Scala was a way to avoid learning modern Java, I'm quite deluded myself. And, I did note that tonight when I had something critical to get done on my commute I reached for Perl. There's still a lot of idioms I need to relearn -- constructing & using regular expressions, parsing delimited text files, etc. Plus, it doesn't help that I'm learning a whole new development environment (Eclipse) virtually simultaneously -- though there is Eclipse support for all of the languages I looks like I might be using (Java, Scala, Perl, Python, Ruby), so that's a good general tool to have under my belt.

If I do really take this on, then the last decision is how much of my code to convert to Scala. I haven't written a lot of code -- but I haven't written none either. Some just won't be relevant anymore (one offs or cases where I backslid and wrote code that is redundant with free libraries) but some may matter. It probably won't be hard to just do a simple transformation into Scala -- but I'll probably want to go whole-hog and show off (to myself) my comprehension of some of the novel (to me) aspects of the language. That would really up the ante.

2 comments:

James Iry said...

"but I didn't type any of my variables above! However, I could have -- so I can get the type safety I find very useful when I want it"

Scala is statically typed whether you put a type annotation or not.

val x = "hello"
val y = x * x // compile time error, not a runtime error because the compiler has inferred that x has type String

Keith Robison said...

Thanks -- clearly I need to disambiguate "static typing" from "explicit type declaration" in my head!