Tuesday, February 18, 2020

Skimming Seq

My last post discussed BioJulia in the face of a challenge from the new Seq programming language.  Tonight I'm going to take a bit more of a look at Seq itself and touch on both why I'm tempted to try it and why I remain reticent to do so.  I hope if any of the Seq team sees this they will regard it as some parts constructive criticism and some parts market feedback.
When I joined the strain factory last year, I vowed to quit writing anything serious in Perl.  I really missed the boat not doing so when I joined Warp Drive, and my experience there was too often butting up against important areas with Perl libraries that were incomplete or plain didn't work.  Bilobans work over 90% of the time in Python (there's a modicum of R and a scattering of other languages), so joining a strong community would help me learn, help me stay on the wagon and gave good reason to not pick something like Julia, which I might have gone with at another startup.

Most of what I code ends up in the service of pipelines or exploring data, so there is rarely an extreme need for speed.  But recently I was in the midst of yet another look at Oxford Nanopore R10.3 data and a notion for a contig polisher came to mind and I implemented it.  Except the initial Python implementation ran so slowly I couldn't even really properly test the code, taking hours for a small microbial genome.  So I truly couldn't tell this "Shinola" method from the substance at the other side of the earthy dichotomy associated with that brand.  A first round of refactoring to lose some code inherited from the earlier browsing got the run times down to many tens of minutes, more useful but still painful.

So I started entertaining a wide array of solutions to the speed problem, ranging from very sane (profiling the code or learning to compile it with Cython) to the fun but not terribly practical (re-learning Julia or learning Rust).  I also contemplated just describing the algorithm in this space and seeing if someone else would code it properly -- that meant revealing my simplistic idea and worse implementation.  There were also flights-of-fancy looking at Python support for parallel code or for Actor models.

Profiling helped me think about my implementation and realize a big speed penalty I was taking with some more leftover implementation from the data browsing origins.  That got it under 10 minutes and Cython got it down by another factor of 1.7 -- I suspect it resists acceleration because it is doing a lot of I/O reading the BAM file.  And that seems to prove it inferior to existing methods, so I'll go check to ensure I didn't botch the streamlined implementation, but my need for speed has abated again.

But one of the routes I did seriously consider was re-coding in Seq.  Seq is, after all, intended to be very friendly to Python programmers in terms of syntax.  My program was only about 60 lines of code, so porting it over shouldn't be too hard, should it?  I didn't get farther than planning, but here are my impressions on reading the online documentation.

One of the first bits in the Seq Tutorial is a section titled "Limitations and key differences with Python". The big one which would be an issue for me is that collections "cannot contain objects of different types".  Perhaps a bit of a hangover from Perl, but I frequently use lists and dictionaries for record types and Shinola had examples of this.  Probably could re-code as objects, but that's one more step.  The other key differences didn't really affect me.

But then there's a list of items that the Seq authors would like to support but don't at this time, and that has some pain points for me.  Empty collection literals [] and {}; lambda functions and an unspecified list of Python standard modules and built ins.  Not documenting what isn't available from the standard library is a serious shortcoming  -- how much of sys and os are there?

There is a section on calling Python from Seq.  It explicitly mentions NumPy as callable; it would be useful to understand which other common libraries and whether issues arise if a library returns collections with a mixture of types.  When I am writing serious code, I may want to read, write or process CSV files with Pandas and I may need to access relational databases and many other things.

For Shinola, the key thing is I must be able to look at alignments in BAM format of the nanopore reads with the draft genome sequence or sequences.  The Cookbook section of the manual has an extremely terse section on "Reading SAM/BAM/CRAM' that shows how to iterate through each of the file types.  But the objects returned by the iterator appear to be reads and essentially have fields with the various HTSLib read fields.  But there's no high-level support for alignments -- the only way to work with the alignment appears (from looking at the Seq code itself) to be to write your own code to work with the CIGAR string.  I don't see any built in support for Pileups at all.  That's going to be a real hindrance for many; I've written CIGAR parsers and it usually takes several days of debugging before it truly handles all cases correctly.  That's really in my opinion a core capability for large scale sequence processing these days, and the lack of it is a major gap in Seq.  

I'll probably keep watching Seq from time-to-time.  A pang for fast code doesn't strike often for me, but once in a while I'm in need.  Seq should be much less of a learning curve than Julia or Scala or especially Rust.  So maybe if profiling and Cython and rethinking an implementation aren't sufficient, I'll turn to Seq in the future to goose performance.




No comments: