Thursday, February 13, 2020

A Short Master Class in Benchmarking

There's a blog post on BioJulia.net that is well worth reading, even if you don't use Julia or I'd argue if you don't actually program.  It looks at an issue of performance that was raised with BioJulia and with fierce but respectful passion examines the critique and explores just why BioJulia didn't perform well in the comparison.  In the end, this triggers a code review and a huge speed increase in the problematic areas -- which will widely benefit BioJulia users.
The setting for this is the new Seq language, a domain-specific language for biological sequence processing intended to be easy for Python programmers to learn.  I have some other thoughts on Seq I'll save for another post.  Seq was apparently announced at a computational conference last fall, and as part of the presentation showed benchmarks where it appeared to clobber Julia code for speed.  That is likely to sting for any Julia partisan, as the language has been designed from the beginning to be fast.

One way I'd characterize the blog post is that they repeatedly bring up points which some might have used as excuses to stop going forward: benchmarking is inherently problematic, benchmark programs are rarely representative of the real world, etc -- but always they plunge forward instead of taking an easy out.

First the BioJulia group obtained the benchmark code -- the Seq authors had made public a virtual machine with their Seq environment, but not the Julia code -- but provided it on request.  On the speed, it was easy to replicate.  However, the BioJulia group found a number of wrong results from Seq  One was a truly horrifying mortal sin: reverse complementing an odd length sequence failed to reverse the central nucleotide.  Another is perhaps a venial sin: ambiguous nucleotides when complemented are simply forced to N, rather to the correct mapping (e.g. S to W and vice versa).  

A first examination was to see if the benchmark Julia code is written in idomatic Julia; the just-in-time compiler in Julia can compile different logically equivalent blocks of code to very different final code with very different running times.  The code was found to be mostly idiomatic and the few changes made little difference in the run time

Profiling the Julia code revealed that most of the time was spent reading the sequences and converting them to the either 2-bit or 4-bit packed formats that Julia uses to represent DNA sequences.  In contrast, Seq apparently stores bases just as byte representations of the ASCII character.  There's all sorts of clever bit twiddling one can do with the correct representation, such as reversing a base being a simple not operation with a 2-base encoding.  Plus it can save a lot of space. Also, Seq performs no sanity checking on a DNA sequence whereas Julia insists that only legal IUPAC characters are present.  Safety is a good thing, but doesn't come free!

Again, some authors would have bailed here and just argued the benchmarks aren't appropriate and that for many applications the cost of checking and packing the sequences will be paid back many-fold.  Of course, for some applications it would be a complete waste -- imagine for example a simple quality-based read trimmer that never actually looks at the sequence content.

But again, the BioJulia folks didn't stop there.  First they reimplemented the necessary code using byte arrays and were able to equal or even substantially beat Seq in the benchmarks.  Those didn't even include another trick up Julia's sleeves in terms of buffering IO; that yielded extra speed but wasn't used in the benchmarks.

Satisfied now that they could beat Seq but unsatisfied with the actual BioJulia library code performance, they reviewed it, with led to "optimising a dozen code sections".  The new code beats Seq on only one of three benchmarks and on one is a bit slower, but its still a huge advantage over the original BioJulia code. 

The BioJulia authors finish with some thoughts about Seq.  As you might expect, they aren't fans of the concept, arguing that using a Domain Specific Language risks losing access to the multitude of other libraries that are often used in conjunction with your sequence analysis.  In the Python world, for example, frequently a sequence tool will use MatPlotLib to generate plots or Pandas to create and analyze dataframes.  Seq claims to be able to use Python libraries in general, but I haven't seen any sort of testing of that.  Again, a separate post on Seq soon.

In the comments there's some further discussion with one of the Seq authors, who left a detailed and thoughtful comment.  In particular, some of the bugs in Seq noted by the BioJulia authors have been fixed. There's also some debate there between the two sides as to what degree speed optimizations offered by Seq (pipelining, prefetching) and future plans (compiling to FPGA or GPU)  that might be in Seq can be implemented cleanly on the BioJulia side.

Another comment gives some short Julia code to reverse complement ASCII codes for bases and points out that packing multiple bases per byte imposes a penalty for some operations --  but the BioJulia folks find that the difference is single digit milliseconds on a 100kb sequence. Plus there's some down-to-the-metal issues which impose memory and possible less optimizability to the posted code.

So a very nicely written post showing the thought process for responding constructively to so very public criticism and how to analyze and optimize a process.  

No comments: