Friday, December 11, 2015

MinION and Time-to-Result

Peripatetic blogger Dale Yuzuki posed a question on my last piece which I'll answer with a separate post because it crystallizes for me what makes the Oxford Nanopore platform so different for a large number of counting-type assays.  Dale's question was on Zev William's talk on pre-implantation screening and the number of reads required.


Zev did indeed go into some depth on the subject of the number of reads required for his application, which is identifying aneuploidies in embryos.  As an aside, I had been unaware that the rate of full-term babies can be nearly completely decoupled from maternal age via screening out aneuploid embryos.  Note that the goal is to simply score chromosome counts; resolving structural variants was not in the scope of this particular project, largely because it is murky which SVs are clinically relevant. 

So, by statistics, simulation, or trial-and-error, one can work out the number of reads required to score aneuploidies.  To have sensitivity for all chromosomes, it is driven by the hardest to find, which is a function of size (tiny chromosome 21) and mappability (the Y is both small and hard to map reads to).  Chromosome 21 ends up be the most challenging, and Williams worked through an estimate that a few hundred reads mapping to 21 are sufficient to enable counting this chromosome.  Of course, gross SVs on the other chromosomes at a similar scale will be detectable. On the other hand, balanced translocations would appear also out-of-scope for this particular assay concept.

Okay, once you know the number of reads required (Williams placed it at 15K well-aligned reads), that needs to be grossed up to account for reads that can't be unambiguously mapped (unless you've already done so). Zev's choice of short (~500bp) reads applies here; this is sufficient for mapping in most cases, and in the cases it isn't it would mostly take much longer reads.  A similar logic has long existed in the sequencing-by-synthesis world; for this sort of counting application the required read length is shockingly short.  Now again, the goal isn't to find or fine-map small SVs; for that one would want long reads. 

At this point, let's look at an early divergence from thinking for other platforms.  Short reads on most platforms mean short running times, which means in turn more turns of the expensive sequencing instrument.  Incyte realized this early on in their EST effort during the 1990s: their customers cared about the number of libraries screened and number of reads per library, and mRNAs could be identified from about 250-300 basepair fragments (if memory serves).  Short runs meant more reads per day, which was where the value lay.  

But MinION is an inexpensive setup, even if 1 monitor and one PC is dedicated to each.  So turning instruments over isn't a consideration.  PromethION is a big box, but Clive Brown assured be that each of the 48 pieces (which I think have two independent flowcells each) is hot-swappable and independent, so PromethION can be treated as a decluttering set of 48 MinION pairs.
Now, calculations for an aneuploidy assay will have to take into account various flavors of junk reads, regardless of the platform. These tend to be predictable for each platform, so the necessary read count is grossed up again.  So for any given platform (really platform version, as Oxford keeps tuning theirs up), there is a target number of reads.

The critical mental shift with Oxford is to start thinking of performance in terms of data per unit time.  On every other platform, output has a time component, in that for any fragments not completely read further cycles (or time, for PacBio) will yield further information on those fragments.  But for MinION, once a fragment is exhausted then that nanopore is available to grab another fragment and process it.  Processing is remarkably fast; with the current devices William's 500bp fragments should take less than 30 seconds to fully traverse the pore in 2D mode.

Since performance is measured in terms of reads per unit time, and we have a constant target of reads, performance improvement should translate directly into reductions in time.  So if Oxford enhances the performance, by some combination of faster translocation through the pores and more sensors, by a factor of 10, then run times for this experiment may drop by a similar factor (although faster traversal won't help with the idle time for a pore between fragments).

Williams' had his assay down to 1 hour of sequencing, and the current library prep takes about 1.5 hours (his group has modified the prep for short fragments, but this involves tuning and not wholesale protocol changes).  So that's 2.5 hours from purified DNA to a complete dataset ready to map (ignoring the possibility that mapping could be running in parallel to the sequencing).  With fast mode or denser chips, the sequencing time might be reduced to 10 minutes or so!

Conversely, the promised-to-be-imminent transposase 1D prep reduces library prep to only 20 minutes (versus 90 for the current prep).  Now, it is generating only 1D reads, which will have a higher mapping failure -- but with that reduction in prep time a greater than 50% lower mapping rate could be tolerated.  Actually, it is better than that, since a 1D fragment ties up the pore a bit more than twice the time of the same sized 2D fragment, since only one strand must transverse the pore.  So either a higher mapping failure rate can be tolerated, or the fragment size increased to get more mappings.  In any case, a 20 minute 1D prep that is sufficient is superior to the 2D prep even if total time (sample to finished run) is the same, since one ties up a skilled lab worker for 20 minutes and the other for 90!  

Possible fits of other Oxford platform facets can also be envisioned for this use case.  Contamination is always a concern; one approach to combat this is to barcode every library, even if running only one sample per flowcell -- reads with the wrong barcode can be ignored.  Oxford reports 99% accuracy in decoding the barcodes, so this might work well.  With an expansion from 24 to 96 barcodes on the way, this sort of barcode rolling becomes more powerful -- the time between reusing barcodes is greatly lengthened.  Since barcodes will soon be part of the ligation process, no further time or effort is introduced (the current barcode scheme uses PCR to add the codes).  I don't remember any discussion of barcoding with the transposase prep, but there is not obvious technical challenge there.

Now, this assay is targeting a certain number of well-mapped reads.  What if our estimate is off?  With conventional sequencing, an extra fudge factor is the obvious way to deal with variance in the yield of mappable reads; just overshoot a bit.  But with nanopore, this presents a read-until situation; keep reading from the flowcell until the correct number of mapped reads (from the correct barcode, if using that approach) has been achieved.  The only catch here is having a nanopore read mapper which can keep up with the data flow from the instrument for something as large as a human genome.  It didn't sound like anyone claims such, so even in read-until mode nanopore right now might need to rely more on estimating the number of reads, although one could imagine mapping a sample of reads in real time and using that to project the total number of mapped reads.

Read until could also be used, again with the caveat that there doesn't appear to be a mapper which cna call human genome mappings fast enough, particularly for these short fragments.  Matt Loose was using the first 250-ish bases to make calls for his read-until pipeline, which is a quarter of the way through a 2D fragment of 500 bases (ejecting fragments, which must be done before the hairpin is hit, is much faster than sequencing them but not instantaneous).  However, with rolling barcodes for contamination protection, read-until could be useful for rejecting contaminant fragments.

The pre-implantation testing scenario is particularly good for this sort of armchair analysis, given that the target genome size is roughly fixed.  However, the same sorts of thoughts apply to other mapping workflows, such as SV detection or Oxford's WIMP (What's In My Pot) metagenomics analysis.  Assay developers are faced with an array of parameters which can be adjusted to fit their application: fragment size, 1D vs 2D, read-until, run-until & speed of computational analysis.   Given the energy expressed in New York, I think there is a growing cadre of scientists who are excited by these additional controls and eager to try them out. That's why any statement of where Oxford isn't going soon, such as one by a stock analyst saying the NYC agenda was light on cancer genomics so Oxford won't be a presence there anytime soon, are quite foolish. 

Of course the other extant platforms can try to work in this space, but its much harder.  First, the capacity of just about any flowcell is much, much greater than the few tens of thousands of reads needed for the pre-implantation aneuploidy testing.  Since the instrument is expensive (and therefore instrument time expensive), it will be hard to resist a pressure to multiplex samples, with the subsequent need to batch runs.  Careful attention to flow and scanning time can result in remarkable gains, notably with MiSeq and the Rapid Run mode on HiSeq 2500, but inherently sequencing-by-synthesis is a batch operation.  Furthermore, sample prep on the polony-class instruments (e.g. Illumina, Ion, BGI/Complete, QIAGEN) requires some sort of amplification, which strings out the library prep time -- the idea of a 20 minute library prep on any of the other platforms defies credulity.  
Fast runs might be possible -- a very hasty back-of-envelope calculation suggests that PacBio could read 1Kb fragments in about 20 minutes (if I remember correctly, 1 hour movies are good for 5Kb reads).  Perhaps in a high value assay some sort of "put it one the instrument and take it off soon after" workflow could succeed there - though there is some sort of time required to settle the machine between flowcells -- PacBio was not intended for this application!   Even if one was crazy enough to do it, the capital cost would require centralizing the operation; even placing 1 such lab in every significant U.S. city would be a sizable expenditure.

But in general, the other sequencing vendors are excited to suggest they can generate results in a day or even a morning; under a half hour just isn't going to be practical.  But on nanopore a year from now, that might be the speed of an aneuploidy assay!

Couple this with the economics of MinION, which obviates the economic pressure to centralize operations to achieve required instrument utilization and economies of scale from multiplexing, and you really could have a revolution.  Truly getting a sequencer in every hospital pathology department becomes economically feasible.  Pushing this out further to large clinics/practices would require taking the skill & lore out of library preparation (e.g. will require a full-blown sample-to-sequencer workflow on VolTRAX).  But remember: every time Oxford announces an improvement in throughput, that translates nearly directly into an improvement in time-to-result.  

1 comment:

Anonymous said...

Nice summary.

On a side note, 500bp reads is a lot for aneuploidy screening. For cell free fetal DNA aneuploidy screening (non-invasive prenatal testing NIPT), the pure shotgun MPS approaches usually only use 36bp, as far as I know. That works nicely and I assume the companies offering this have optimized it to the most efficient "sweet spot". And of course, in the setting of cfDNA, fragments are only aroung 150bp anyway.

cffDNA/NIPT might actually be an interesting field for nanopore sequencing.
30s for a 500bp fragment, lets say roughly 10s for a 150bp fragment of cfDNA. So for 5-10 Mio reads, you would need 50-100 Mio seconds per pore (poreseconds seems like a useful unit in nanopore sequencing....). So you know how many pores you need in order to be ready in a day, or an hour..... If you have an amplification free and rapid library prep, that might one day be very attractive.

But anyway, you nicely pointed out the disadvantage of the current lineup of NGS sequencers. They get bigger and bigger and cost per base goes down with it, but they miss out on a lot of lower throuput needs. Nanopore sequencing might really be a leap forward, especially if sample/library prep will be so easy.