Omics! Omics!: Asymptotically Approaching A Grok of Scala

Saturday, May 01, 2010

Asymptotically Approaching A Grok of Scala

When learning a new language, it is tempting to fall back on the patterns of a previous language. This isn't always a bad thing but is worth being aware of. For example, when I did a little bit of Python at Codon I realized that compared to someone else who had just learned Python, I tended to use dictionaries in my code quite frequently. That's a pattern coming from Perl. This was also reflected in my C# code, except there (to my glee!) I could use typesafe dictionaries. My code at Codon, in comparison with some other programmers, tended to be very Dictionary-rich (and they were always typesafe!) That's not saying my style was better, just distinctive and influenced by prior experience.

Now in some language transitions, there's very little of this -- because the new language is too different. SQL is an obvious example -- it's just not a procedural language and so I can't easily identify any of my SQL programming patterns which are influenced by prior languages.

But, if a programming not only supports but encourages a different style of programming, it is useful to recognize this bias and try to go outside it, and when you have a breakthrough it is wonderful. For me, to intuitively understand a subject is to "grok" it; Heinlein's invention is too rarely used.

I had that moment tonight with Scala. The assignment was to read genotype data out of a bunch of Affymetrix 6.0 CHP files from a vendor. Now, Affy makes available an SDK for this -- but it is a frustrating one. The C++ example code is all but a printf statement away from converting CHP to tab-delimited.

But I decided to make this a Scala moment. There's a Java SDK, but it is very spartanly documented -- there's really no documentation beyond what individual methods and classes do -- no attempt to help you grok the overall scheme of things.

Worse, the class design is inconsistent. One case: the example Java code parses an expression file and one key piece of information to get out is the number of probesets in the file, which is via the getHeader() method. Unfortunately, it turns out getHeader is defined in the specific class and not the base class, so code working on genotyping information needs to use a different approach. Personally, I'm already annoyed because I'd rather have an enumerator to step over the probesets rather than getting a count and asking for each one in turn -- but that is a point of style.

Okay, problem solved. The main part of the code reads in the data into a big HashTable (the dictionary-type generic class in Scala) -- that pattern again! Now I want to write the data out -- listing each genotype in a separate column with the 0th column containing the probeset name. So, I need to create a row of output values and then write it as a line to my file.

Version 1 is the straight old-style, what I used in Perl/C# and pretty much everything before it -- I initialize a Queue to hold the values I want to write on one line. Here out is a Java BufferedWriter which is writing to a file. The one significant Scala-ism is the code to write the line -- the reduceLeft function (bolded) is the equivalent here of a Perl join command to create the tab-delimited line


val q = new Queue[String]()
q.enqueue(probesetName)
for (sample<-sampleNames)   
   q.enqueue(ProbeSetMultiDataGenotypeData.genotypeCallToString(genotypes(probesetName)(sample)))      
out.write(String.format("%s\n",q.reduceLeft(_ + "\t" + _)))

Now, on looking at this I had working code, which should be time to stop. But could I take it to a more Scala-ish form? That's a challenge, which I'm happy to find I succeeded at.


out.write(String.format("%s\t%s\n",probesetName,
   (for (sample<-sampleNames)
       yield ProbeSetMultiDataGenotypeData.genotypeCallToString(genotypes(probesetName)(sample)))
    .reduceLeft(_ + "\t" + _)))

This version eliminates the queue -- an anonymous function (bold) simply generates a list which the reduceLeft trick consolidates. I had cheated before and loaded the probeset name onto the queue, so here I need to tweak the String.Format stuff to get that in.

Now, the question is -- is this better? One metric might be readability, and I'm not sure which I find more readable. The first is a style I'm used to reading and I tend to recognize the pattern -- or do I? If I revisit that code 6 months from now will I say "What is this queue for?". The second one is terser -- but is it a good terser? Perhaps if I start using that pattern repeatedly it will become second nature to read

Another would be performance -- which is tedious to measure but my guess is that since I am following the form suggested by the language, it is likely to optimize this better.

Ah, but after writing this entry I saw I could do better -- definitely cleaner. Instead of the explicit loop in the code I'll use the map function, which takes a series of values and applies a transformation on each. So I still have a long way to go before I can claim to grok Scala! I could blame this on being diverted away from Scala for a month plus (I'd actually created some code like the below before, now that I think about it)


out.write(String.format("%s\t%s\n",probesetName, 
    sampleNames.map(sample=>         
      ProbeSetMultiDataGenotypeData.genotypeCallToString(genotypes(probesetName)(sample)))
 .reduceLeft(_ + "\t" + _)))

It is worth noting that this final style is actually largely available in Perl, which has a map function and some other stuff to support this. I never really tried to work that way and personally I foresee all sorts of bugaboos from a lack of type safety. But I could have worked this way in the past.

One final note: I'm getting to like the way Scala can do a lot of compile-time type checking without my needing to clutter the code with lots of type annotations. C# is particularly bad about most type annotations being written twice, but even after cleaning that up Scala goes one further and infers many types. "sample" in both examples is strictly a String, but I don't have to declare that -- and so the code is stripped to nearly the bare essentials but I get a bit of proofreading