Omics! Omics!: ONT Sketches Paths to Long, Selective, Accurate Sequencing

Tuesday, June 29, 2021

ONT Sketches Paths to Long, Selective, Accurate Sequencing

Some sort of summary of London Calling in this space is grossly overdue after getting caught by multiple work firedrills and then several recursive rounds of procrastination. I'm not going to attempt to cover all the company announcements. I'm going to focus on a cluster of announcements that show a long range vision of inexpensive sequencing consisting of very accurate, very long reads. Well, a cluster of visions -- some parts can be mixed and matched and others cannot. This should be a prospect to grab the attention of any current or aspiring ONT competitors. Now before I'm accused of being a gullible shill for Oxford, I want to make it clear I think that running the table on these will be technically difficult and is many years in the future. But even if Oxford manages some of these but not all, they would substantially upgrade their platform.

PromethION 2

The most mundane announcement I'll include here is the launch of the PromethION 2, with only 2 flowcell slots and a much lower commitment to purchasing consumables. So a lab wanting to sequence large genomes but not in vast numbers would be the market for this box, which would include the necessary GPU to keep up. At only $60K this should be an easy device to launch. A second version of P2 is basically a hub intended to plug into a GridION, providing an easy path to PromethION for existing GridION users. Pricing on that model seemed to be absent.

Basecalling & Pair Decoding

The next on the radar are the Bonito now available in MinKNOW, improving current error rates. These include a high accuracy and superaccuracy model; the latter requires so much GPU that the GridION can't keep up. So do you want reads hot or more accurate: your call. ONT has clearly built great expertise in training models for basecalling; it is a core strength.

One of those models goes with some slightly tweaked chemistry -- a slower motor -- to give the Q20 chemistry which was described briefly around the time of the Community Meeting and has been getting rave reviews from a limited set of alpha testers. I haven't had a chance to look at any datasets -- I think there are some microbial ones plus an ultralong Q20 Cliveome set (genome of CTO Clive Brown). I haven't seen a really detailed look at the error profile of what is left behind -- how random is it. ONT remains reluctant to switch from their current practice of linear scaled accuracy to the far more revealing log scaled error rates which PacBio shows for their data -- there were some plots log scaled but most linear scale. It would be also interesting to look at some large datasets with reasonable truth behind them to look for systematic errors and the performance on long homopolymers or simple repeats.

The Bonito models also handle pair decoding -- if one has signal for two complementary or identical segments they can be decoded together. Clive described how they are trying to increase the number of natural duplex (aka "follow-on") reads -- cases in which sequencing of one strand is immediately followed in the same pore by the complementary strand. This has been seen at low levels in the past -- it led to the briefly lived 1D squared chemistry -- but ONT now thinks that by tweaking library construction (to maximize molecules adapted on both ends) and library concentration on the flowcell (low, to minimize interlopers jumping in instead of the second strand) that they can get this phenomenon into tens of percent of the reads. A new E8.1 motor performs better on the complement strand when performing pair decoding, though Brown didn't describe how there is a "memory" within the pore or motor that it is working the second strand.

One route to improving pair decoding that will have broader impact are trans tethers -- the molecule tethering library molecules to the membrane is attached to the membrane on the trans side and pokes back through, with a way of docking to library molecules. This eliminates the tether sticking library molecules to every unproductive surface within the flowcell. An overall boost in sensitivity is the immediate benefit - ONT reported generating 5 gigabases of data from only one nanogram of input DNA.

Topoisomerase Libraries

Speaking of library construction, ONT described a third chemistry beyond the ligation and transposon technologies. A site-specific topoisomerase from vaccina virus can be loaded up with adapters bearing the enzyme's recognition sequence and then glue them onto A-tailed ends. The process is expected to have a simplicity similar to the transposase prep -- hope for lab incompetents like moi -- but importantly won't fragment molecules as the transposase does. Furthermore, it won't generate adapter dimers -- so presumably the adapter concentrations can be ramped up to drive adapter addition at both ends of library molecules. Unclear when this chemistry will be made available.

Voltage Sensing

On the chip front, progress continues on the voltage-sensing (rather than current sensing) with much higher densities. I still think they have a long ways to go -- it's just getting to the breadboard stage -- given the amount of data to pull off in real time and the need to train new models to hit the accuracy of the old ones plus ensuring they can be manufactured, this one still feels a few years out though ONT didn't really give guidance. But should they succeed, the data density from each of their flowcell formats might go up several fold, with a potential similar reduction in cost per base. Or one could see this as a way to ameliorate issues with low yield processes -- if only 25% of your library are duplex reads but you are getting 4X as many for the same flowcell cost, that's a wash on total yield but the accuracy would be much higher.

The Innie's And Outtie's of ONT

Okay, the next part is involved and also has some ONT jargon I've not yet adapted to. But I also can't figure out a better replacement and besides it's their zoo so they get to name the residents.

Clive pointed out that there are a number of ways to perform strand sequencing. For example, DNA sequencing on nanopore is currently 5'->3' but direct RNA runs 3'->5'. Both use an approach that Clive calls "innie" -- the helicase motor is on the front of the adapted molecule, docks to the pore and then drives a strand through the pore with that strand exiting the pore after the end is reached.

He then laid out a different scheme of "outie" sequencing, in which the helicase docks but then ratchets backwards on the DNA to separate it, then drives a strand through the pore quickly without sequencing it. I call that the downstroke; I don't believe ONT did. But the DNA can't escape when the end is reached and the motor proteins acts like a stopper. This causes the helicase to stall and the strand can slip back through the pore and the strand is sequenced on the upstroke once the helicase restarts -- but again, when the end is reached it doesn't exit. This cycle can repeat, apparently resequencing the last few to perhaps ten kilobases many times. For as many iterations as desired. At any time, a voltage can be used to eject the whole complex upwards much like the current selective sequencing scheme.

Now you have lots of reads of the same molecule to use for decoding -- or at least one end of it. Importantly, as pointed out by Clive, there's no theoretical length limit or dependence of the quality from the decoded sequence. This is in contrast to PacBio's HiFi scheme that favors shorter molecules (short being relative, with some libraries yielding 25Kb HiFi). ONT also currently has a higher baseline accuracy, so fewer passes should be required to hit a given quality score. If the decoding can keep up with the sequencing, then one could have quality-driven number of passes -- keep iterating until the sequence quality is above a pre-determined cutoff.

Also interesting is that while the downstroke doesn't generate sequence, the time required for it is proportional to the sequence length. So there is a potential to add length to a selective sequencing scheme -- run multiple passes only on sequences within certain regions of the genome (or certain novelty criteria) and which are above a length cutoff. Clive proposed this for gap filling of genomes, a task that is much rarer than it once was but still annoyingly important. It would seem that there is a need for an entire programming language devoted to current and future selective sequencing so that the selection criteria could change over the course of a run and include all the various criteria (sequence, length, consensus quality) that might be applied.

Now this "outtie" business is clearly a great leap -- it will require new models, new enzymology and so forth and Clive wasn't revealing where they are in the process. But it plays to ONT strengths -- basecaller model building, pore and motor engineering, fine scale voltage bias control. So again, this looks many years out -- but it should really terrify competitors (and if you are in the C-suite, your job is to be paranoid about the competition) because it could offer reads hundreds of kilobases long

What might bog it down? Well, there's all the technical challenges above. There would also be the potential problem of really nasty DNA knots tying up pores -- as someone online commented it is now apparent why Clive has been lukewarm on putting nucleases on the trans side of the pore to destroy DNA before it can knot, but this new scheme could really be preyed on by secondary structures.

Clive also described a way to perform size selective sequencing with existing pore chemistry by using a slightly different library adaptation scheme. A hairpin adapter is ligated to one end of a fragment with the motor protein on the hairpin itself. The motor will then unzip the double helix and feed one strand through the pore but too fast to sequence. Again, it is the motor bumping into the pore which starts the sequencing cycle -- and the time spent unzipping that first strand can be measured and used to estimate the length of the fragment while the second strand (after the hairpin) is what is sequenced.

This seems to me like a clever demonstration but not a great product. Product line complexity has been ONT's bane, confusing customers and challenging their supply chains. A specialized library prep kit is required, with some tuning of conditions -- double-adapted molecules are very undesirable here. Brown showed some interesting size enrichment, but how many labs are really going to buy such a product. Plus you give up on duplex sequencing.

Summation

So ONT has a number of different initiatives on boosting their platform. As I've attempted to note, some are complementary whereas others are purely divergent. The voltage sensing could really give unbelievable productivity but is the farthest out. Switching to "outtie" seems like a big move, potentially giving very high accuracy for a few to several kilobases and also the size selective sequencing. And more, including a bunch of developments I'll skip so this piece doesn't get permanently stuck in draft mode -- though I may revisit some other bits in the future.

That's all exciting, but having too many pans going at once has often bedeviled the ONT kitchen. Without great care in messaging and training the sales staff and educating the user base, such can make the already complex ONT lineup even more confusing. Managing a complex supply chain and actually delivering the correct product on time has historically been a challenge for ONT.

ONT has often tried to position themselves as only going after Illumina, with PacBio being barely worthy of mention. Clearly that wasn't the case this time -- the various quality improvements are aimed at HiFi and Brown was very direct in pointing out that many of the ONT approaches won't have the length constraints of HiFi. He also kept making a point about whether he had said "performant" or "performance", exaggerating his enunciation at times -- with a hint that these two words had been tangled over in the presence of lawyers.

Certainly Illumina is still very much in ONT's sights -- the fact that some of the advanced basecallers frequently hit Q30 -- the quality cutoff by which ILMN reads are judged -- was emphasized at one point. Brown also made the claim that with the higher quality many variant callers developed for short reads perform well on ONT data.

5 comments:

Anonymous said...: A typical promethION flowcell sequences a human genome to ~30X, but the Q20 cliveome consists of eight promethION flowcells. This implies that the Q20 chemistry leads to lower yield. Perhaps that is why Nanopore limits the Q20 chemistry to early accessors. The thing with some Nanopore announcements is that there are definitive improvements in one direction but there are also hidden caveats in another direction.; Wednesday, June 30, 2021 9:11:00 AM
Anonymous said...: I guess users will find out for themselves when they get the Q20

My understanding was that for internal runs they use reject sup-par flowcells with fewer pores and they don't necessarily run them to completion, after all they make them so its not costing. I don't think the point of that Cliveome thing is to demo throughout after all.; Thursday, July 01, 2021 10:45:00 AM
Anonymous said...: surely the masters of improving in one dimension and reducing in another are PacB. HiFi reduces throughput by a factor of ~30X and increases cost the same, it even reduces the read lengths.; Thursday, July 01, 2021 11:01:00 AM
Anonymous said...: They badly need to put more attention on direct RNA. It is like they don't know that nobody else can do it...
So many pipedreams being worked on instead while the potential golden goose is left as a lame duck; Thursday, July 01, 2021 5:44:00 PM
Anonymous said...: Hi, thanks for the useful blog.
Will you be able to comment further following the recent tech announcements from ONT?
Thanks in advance.
p; Tuesday, December 07, 2021 9:12:00 AM

Omics! Omics!