Tuesday, September 09, 2014

Reanalysis Lays Bare MinION Review's Spectacular Flaws

I will confess that when our first MinION burn-in data for lambda came in & I threw a few aligners at it (after first getting my data extractor in Julia shaken out), I was disappointed at the results.  Very few 2D reads, very few aligned reads and the alignments all short.  At this point, I sat back to wait to see what others had experienced and to think of additional bioinformatics approaches.  It never occurred to me to dash off a glorified  blog post and submit it to a journal.
Of course, we know that is the approach that one group in the MAP took.  I wrote my protest last Friday and Mick Watson wrote another piece shortly thereafter, and since then Samanta at Homolog.us has taken a gimlet eye to our writings.  But none of the three of us seemed to have pulled the Mikheyev & Tin data and played with it. So I decided today to remedy this obvious failure .

So, I downloaded the Mikheyev & Tin data and ran their analysis scripts, which appeared to reproduce their results.  These used BLASTN and BLASR as aligners for the data.  This already was curious, as I don't believe I've seen anyone in MAP post good results with BLASTN.  BLASR would seem a reasonable choice, but all aligners are sensitive to parameters -- and I haven't been using BLASR so wasn't sure of the current best parameters.

So, off to LASTAL, with which is what I've performed nearly all of my MinION work.  Now, as mentioned before, parameters are critical.  Since the MinION data, particularly single strand data, is rich in indels, the both the gap open and gap extension penalties should be very liberal.  My choice is 1 for all the penalties/rewards (lastal -a 1 -b 1 -q 1 -r 1 ...).

Summaries of the alignment statistics can be found in Table 2 of  Mikheyev & Tin.  For BLASR, they aligned 3.4K (12%) of the 1D reads, for 0.65Mb of bases aligned to the lambda reference..  This yields a coverage of 13.5X.  For 2D, the statistics aren't much different: 0.9K reads aligned yielding 0.81Mb of aligned bases.  With the LASTAL approach,  8.8K of the 1D reads aligned to yield 33.3Mb of aligned bases. Similarly, for the 2D data, 3.6K reads aligned yielding 6.6Mb of aligned bases.  So the improved alignment procedures increased the number of aligned 1D data by about 51X and the amount of 2D data by about 8.1X.  These numbers are conservative; I kept only the longest alignment for each read, which precludes some remaining cases in which two alignments are made which plausibly represent a central region the which failed to yield an alignment (I do apologize for some different numbers I tweeted earlier; I let my excitement get the best of me and the 2D number was quite off).

Many of what the improved alignment regime recovers are some very long reads. The longer alignments from their BLASR dataset are improved modestly, such as the best alignment of 9044 bp going to 10331.  More dramatically, my longest alignment is 29.2Kb (which actually covers 35.4Kb of the reference).  This isn't a knock on BLASR; properly parameterized it would probably perform significantly better.  

It is also worth noting that Mikheyev & Tin describe the major error mode in the ONT data as "The major sources of error in MinION data were indels, particularly insertions that introduce spurious data".  While I agree indels are the major error mode, I actually see a greater number of missed bases than added ones; the length of the aligned region on the reference is almost always longer than the length of the aligned region of the read.  This also suggests the ideal ONT aligner would not apply gap penalties symmetrically.  I haven't seen that, though there are some newer schemes than the one I used here which apply a custom substitution matrix.  Ewan Birney has a prototype tool called porewise which goes back even further, aligning the original measured events rather than the called bases, the the reference.  Clearly there are great opportunities for clever minds to develop better ONT data processing tools, but that requires applying some imagination, which is sorely lacking in the manuscript.

Another point in the Mikheyev & Tin manuscript is that the distribution of their read lengths did not fit the expected distribution of fragments from the lambda genomic library (this was presumably determined by a device such as a BioAnalyzer, though I don't see the method described).  Such a correspondence was reported in the brief presentation on MinION by Bruce Jaffe at this year's AGBT. Eyeballing the histogram of the span lengths of my alignments on the reference, it would seem the 1D alignments (top) are still a bit downshifted.  More significantly, the 2D reads (bottom) are still not behaving. As I mentioned in the earlier post, Oxford ran into some real-world problems with the distribution of the first round of kits, which severely reduced the number of 2D reads generated.  Within the MAP (and now in their public presentations), Oxford has acknowledged the problem and been working to refine both their distribution scheme and the protocol so as to increase the 2D yield.

Mikheyev & Tin performed a lot of extrapolation from the alignment statistics they obtained and calculated, such as a seemingly absurd number of flowcells to sequence even a small microbial genome.  In contrast, Nick Loman has discussed SNP calling on bacterial genomes and Mike Schatz has started talking about his work assembling Saccharomyces using corrected MinION data.  Because Mikheyev & Tin failed to explore a wide range of parameters, or to pay attention to the MAP community whilst they were rushing off their manuscript, and as a result they've generated a grotesquely flawed analysis.  It would certainly be the honorable approach for the authors to retract this embarrassment before it is actually printed.

More datasets generated using newer chips and kits should be showing up soon; Nick Loman is apparently wrangling a very large microbial set to a public location. These should give a wide audience a snapshot of the current status of the platform, though the next round of protocol improvements, kits and flowcells is already winging its way around the world.  I'm sure there will be lively debate around biases, error modes and real throughput, as well as what applications it can and cannot be used for (and particularly applications for which it is uniquely suited). If Mikheyev & Tin can contribute anything to the evaluation of this sequencing platform, it is to provide a stark object lesson in the necessity of well-vetted methods, and the wisdom of collaborating with the community rather than launching misguided solo punditry. 


homolog.us said...

Hi Keith,

I did my own analysis and had trouble getting anything meaningful. Maybe you can tell me where I am going wrong.

I took all reads from Mikheyev's set and determined a k-mer distribution (k=12). If the data set is truly reflective of the genome, highly occurring kmers from that distribution should match the genome. I do not see that happening.

I am doing the same analysis with Ecoli data now, and if that one works, then maybe the problem is with Mikheyev's data.

Will get back to you.

Keith Robison said...

The kmer analysis you describe is going to fail with high indel rates.

What does this look like for raw PacBio CLR data?

homolog.us said...

"The kmer analysis you describe is going to fail with high indel rates."

Mathematically I do not see why. Here I worked out k-mer dist in E. coli, but still lot needs to be done.


Keith Robison said...

Sorry, wrong thought - error mode didn't matter, just error rate. If you picked too long a For the error rate, then few legitimate kmers will exist and you won't find your relationship.