Tuesday, June 03, 2014

Dabbling with Julia

As I've remarked before, I've done significant coding in a large number of languages over the last 35-or-so years.  I don't consider myself a computer language savant; I've known folks who can pick up new languages quickly and switch between them facilely, but for me it is more difficult.  I haven't tried learning a new language in perhaps 5 years, but this week I backed into one
By backed in, I mean it wasn't my intent to try a language.  Usually I've made a conscious decision to explore a language by first reading about it and thinking about it (Forth & Prolog never got past this stage), and then trying the language in a leisurely manner on some unimportant side project.  But this time I ended up writing code that I somewhat cared about, which is how I ended up trying something new in the first place.

What got me there was a snag, plus some pre-thoughts from my longtime computer language maven: my oldest brother.  He steered me into both C++ and Java before they were cool, and hasn't let my use of Perl come between us.  He had remarked a few times that the language he is keeping an eye on is Julia. Now his interest is high-performance numerical computing, which is the bent of Julia.  My quick look at the bioinformatics side suggested that while there is a basic sequence manipulation library (translation & such), there isn't yet Julia libraries for reading and writing sequences, reading BLAST or similar alignment output, and the numerous other bits of everyday bioinformatics that BioPerl covers.  Now, one of the strengths of Julia is apparently an easy borrowing of libraries from C and from Python, so perhaps that is a route if you really want to dig in, but that's not appealing or realistic right now for me. So I didn't pursue it.

However, I was faced with a work problem where I needed to read some HDF5 files. HDF5 is a structured, compressed file format that is starting to gain traction for large bioinformatics datasets.  Pacific Biosciences uses it for storing much of their data, and other uses are starting to show up.  Unfortunately, I soon found that Perl's HDF5 library is incomplete in an annoying way: numeric types aren't properly supported.   So one of the elements I wished to extract from the file came not as a nice floating point number, but instead as some packed value.

So, I was faced with a problem, with multiple possible solutions. I could try to figure out how to decode the value.  I could parse the output of a generic HDF dumping tool.  I could use Python -- what PacBio relies on -- which I dabbled in a bit back at Codon (over 5 years ago!). Perhaps I could try R, which I've never been strong in but have gotten useful stuff done.  Or, perhaps I could see what Julia could do.

Somehow I was in the mood for the last option.  So I installed Julia and the HDF5 library for Julia (it very nicely found the HDF5 libraries on our system, which contrasts with some of the installation nightmares I've had with some Python stuff). Before long, I had a working program for extracting the key information from my HDF file.  Not without imposing some grunts and groans on my neighbors, due to rude surprises such as Julia using 1-based indexing for arrays (have I run into this since BASIC in the 80s??).

import HDF5: h5open, read, attrs

for (index,filename) in enumerate(ARGS)
    fid = h5open(filename)


    println(size(someData)," ",ndims(someData)," ",length(someData))

Above is my code, somewhat mangled -- I'm being a bit paranoid, but since the file I'm trying to read is under a confidentiality agreement and I don't the other part to argue I broke that by exposing its structure, I've put in some silly stuff.  But, it does give the gist of how much / little I've done -- just iterate over the command line arguments, open each one, and then read some data.  HDF5 organizes data into groups, which can have subgroups and so on.  Actual data is in a dataset.  Both groups and datasets can have attributes to store metadata.

I tried a few things that don't show up above.  For example, Julia has Perl-style variable interpolation in strings -- the line "println("x:$foo") would substitute in for $foo the value of variable foo

Will I run more Julia code?  I'm on the fence, but mostly probably none anytime soon.  If I'm going to learn Python's bioinformatics libraries, I probably should just do that in Python. Julia does have some interesting support for multiprocessing, but I don't see a need for my code doing much of that (I certainly routinely use tools that use MPI or multi-threading).  If I had lots of spare time I might go write the missing parsers and writers myself -- after all, I've done this before and I could force my idea of good object design (which BioPerl doesn't have!) on the world. But debugging such parsers is an unending and thankless task, best left to someone with more energy than I.

Still, I have found in the past that even trying another language can bring benefits.  I work in Perl because my brain is well-grooved in that way; problems start decomposing into code.  But my little foray into Scala, the last new language I tried, ended up slightly shifting my Perl coding style -- I use grep far more than I used to. The bit of code above is probably too rudimentary to have any real effects like that, but perhaps the next time I try out Julia I'll do something more complex and take away something more than a simple file parser.


Adam Retchless said...

R has one-based indexing...

Keith Robison said...

thanks for pointing that out -- I should have checked. I use R so infrequently, and often following published or my own recipes, that I only deal with that issue on the spot -- and then promptly forget it!

Jonathan Badger said...

Matlab also has 1-based indexes. It's pretty standard in any numerical context to do that, as that's how indices are normally used in equations and matrices.

As for Julia, I have been playing with it too -- it is still in its early days and many libraries haven't been written for it yet. The reason why people are excited about it is that it provides a simple syntax (sort of a mixture of Matlab and Python) while having *much* better performance -- quite near C/Fortran speed.