Thursday, February 20, 2025

Roche Xpounds on New Sequencing Technology

Bar bets can be a powerful force in human society.  One of the best known books on the planet, The Guinness Book of World Records, originated from the need to equitably settle wagers.  Many entries in that tome are questions of immense scale - the largest this or heaviest that.  Shortly before this posted, Roche unveiled a sequencing technology that per its inventors may be the result of such a bar bet: how large a dangling bit can you stick on a nucleotide and still have it incorporated by a polymerase.  


A New Nanopore Technology

Roche's Sequencing By Expansion (SBX) technology is a continuous sensing, nanopore-based strand sequencing - except rather than native strands they what Roche calls “surrogate polymers” - radically modified versions of the original which retain the same sequence information.  To understand why the odd nucleotides, it’s useful to look at the best known nanopore sequencing platform: Oxford Nanopore.  Since ONT had the only nanopore platform on the market, it's been easy to slip into a lazy habit of using "nanopore sequencing" to mean both the generic concept and the specific implementation by ONT.  Roche now forces some greater discipline.  While some aspects are similar, there are important differences both in terms of mechanics and also output.


A key challenge which ONT has faced is that a biological pore is much longer than a single nucleotide.  As a result, the signal from DNA passing through a pore arises from many bases interacting with the pore.  So instead of mapping four signal states to bases, there is the harder problem of mapping four to some power states with kmers.  An added twist is there isn't actually only four bases, but rather all the different modifications - something which ONT has leveraged for new capabilities and insights but it definitely muddies the basecalling waters.   Plus, the bases are those that nature gives us, and some may be difficult to distinguish because they interact in only subtly different ways with the pore.  In addition, the nucleotide strands may form secondary or higher order structures on either side of the pore which may complicate controlled translocation through the pore.


SBX attacks the basecalling problem by using “bases”  which attempt to erase all these issues.  If the replacements are the length of a pore and can be stepped through metronomically, then much of the complexity of base calling is eliminated - and what is left can be fixed by tailoring the replacement so that each base creates a maximally distinctive signal.  This comes at the obvious cost of erasing any modified base information; everything is compressed to a 4-letter alphabet.  It also means resynthesizing the library molecules.


This Chemistry is Huge!


But to do all this requires a Brobdignagian Frankenstein of a base - the Xpandomer nucleotides are twenty kilodaltons apiece!  A gigantic scissor-like structure protrudes and a cleavable bond is introduced into the phosphate backbone.  On treatment with acid, the bond is cleaved and the scissor unfolds, creating an extremely elongated structure which also bears a region which provides an unambiguous signal for each base.  


When I covered SBX previously when Roche acquired Stratos Genomics, a question was what stumbling blocks remained.  A key breakthrough for the SBX technology was improving the processivity of the replacement process - the polymerase would not proceed more than tens of bases before giving up.  Protein engineering has been used to improve matters, but a crucial bit was to add yet another bit of functionality to the expansion nucleotide - a moiety which enhances processivity during SBX copying.  With that, library molecules over a kilobase can be translocated - stomping on the skepticism of an early advisor who thought such gigantic molecules could never be pushed through a nanopore.  Yet another functional element, occupying the top of the molecule, enables precise translocation control.




The polymerase for the copying step is a highly modified Y-family translesion polymerase, with greater than 10% of its residues mutated.  Yow!  So mutated that Roche says that it really doesn't work well on standard nucleotides. Many of these changes are near the Xp strand.  Roche seems to have thrown the kitchen sink of protein engineering approaches at this - surface display, directed evolution, machine learning, molecular dynamics simulations.  The end result is a strand displacing polymerase with a base incorporation error rate of 0.7% and processivity sustained over a kilobase - and the starting polymerase had low processivity and was not strand displacing. Sequencing coverage is remarkably uniform over most of the percent GC range - basically flat from 20 to 80 percent 




For sensing, SBX uses Genia’s dense arrays of protein pores on a reusable 8 million sensor CMOS array.  Roche has achieved far better than Poisson loading of the sensors with pores, often achieving over 90% of sensors generating useful data.  A chemical element on the primer oligo drives fast association of library molecules with pores; only short time intervals Translocation is carefully controlled by 1.5-2.0 millisecond electrical pulses; this metronomic translocation plus the SBX tailoring gives signal traces that look decodable by eye - four distinct signal states at constant intervals.  The signal is a bit more complex - there is some drift in signal, but its certainly not the arcane squiggles seen in ONT traces.




But What About the Data?


End result is a gusher of data at usable quality.  Roche is showing off a run which generated seven human 30X genome datasets with only 1 hour of data collection time!   Even adding in 2 hours of the copying and expansion chemistry upstream of that, this is crazy fast data generation - sample to analyzed duplex data in under seven hours. Raw simplex data quality is “Q20+”, similar to older Illumina chemistries but less than modern XLEAP Illumina chemistry or Element data and far less than high accuracy short read sequencing such as PacBio Onso, Element Q50 or Ultima’s ppmSeq.  For counting applications, that is more than good enough.  


For applications demanding higher accuracy Roche is offering a duplex library kit that uses a simple Y adapter to link the two strands of a fragment together and generate data in the high Q30s.  When calculating duplexes, any non-duplex regions (for example, due to one strand having a nick) are assigned a lower quality score kept for mapping purposes.  Roche has been collaborating with the omnipresent DeepVariant team at Google to ensure that there is a custom model for SBX technology.  Duplex inserts of 350 basepairs show no degradation in accuracy.  Homopolymer accuracy for duplex is said to be >99% F1 for under 15bp, with falling off after that dependent on which basecaller is used.  




Duplex SBX sequencing can be both fast and accurate.  Data presented by the Broad Clinical Labs uses the SBX-Fast (no PCR, duplex)  library prep and only 20 minutes of data generation to generate 27X median coverage of HG001 with a Q-score of 39, for a library prep to VCF time of 6 hours, 25 minutes!   



Simplex read length of SBX technology is dependent on fragment length and of course processivity.  While Roche isn’t claiming reads in the truly long lengths of PacBio or ONT, they show the potential to generate hundreds of millions of kilobase reads and  tens of millions of reads in the 1200 basepair range (and slide captions talk about “1500+” basepairs) - what I’m going to dub “midi read sequencing”.  That may be interesting for some applications, a topic I may expand on soon.  For example, even if not full length mRNA reads in that range can elucidate much more complex splicing patterns than typical 2x150 reads.


Errors are independent of position in the read or of library fragment length, as one would expect for a single molecule, non-clonal sequencing method.  Kokoris says that errors are more-or-less evenly split between errors in Xpandomer synthesis and errors in data collection. Unexpanded nucleotides or other pore blockages can be backed out. Quality scores will be binned to three numerical levels, corresponding to high, medium and low quality. Homopolymers appear to be a major source of error, presumably due to slippage during SBX generation.  It should also be noted that accuracy and performance are independent of library length distribution; clonal short read systems often have lower throughput and accuracy with long insert libraries due to large clusters/polonies and other effects.




Cost

Roche isn't yet describing pricing in detail or a commercialization date beyond "in 2026" but from what they have disclosed some cost estimates can be made - they did claim that SBX will cause a "new inflection point" in data generation costs. In the Q&A the final question was around cost, and nothing got beyond "meet the expectations" and somewhat circuitously suggested there will be multiple instrument options at mid-throughput and high-throughput price points.


Consumable Cost


A key bit of the expense of sequencing is the flowcell cost, and with a CMOS array the Genia-derived SBX device will have a non-negligible cost-of-goods.  But this component is reusable for perhaps 10s of times (Roche hasn't nailed this down) and the membrane formation and pore insertion is automated on the sequencing instrument.  


Another cost component are those wacky Xpandomer nucleotides.  But these are used in a single pot reaction to perform the expansion; there's no multiple flowing of expensive chemistry as with a cyclic short read sequencing system.  So the cost impact of the Xpandomers should be very modest.


One thing that further moderates the cost is the largely symmetrical nature of the SBX modification, which enables a relatively simple convergent synthesis. 



Instrument Capital Cost Speculations


The Roche technology divides the workflow between two instruments, a benchtop instrument to run the expansion chemistry and a floor model sequencer.  There's a regular tug-of-war in the industry between keeping these functions separate vs. merging them into a single instrument, but Roche makes the argument that keeping them separate enables better matching of throughput.  In other words, if you are going to be running many short duration library pools through the process, you might want a higher ratio than 1:1 of expansion instrument to sequencing instrument. 


Neither would be expected to be particularly complex on the wetware side.  Expansion is performed on a solid surface,  requires a typical polymerase incubation, treatment with acid to drive the expansion step and exposure to UV to release the expanded molecules, plus wash steps between these.  That's some really simple fluidics, the sort found in an iSeq or PGM.  So this shouldn't be an expensive box.  The surface has the primers for the reaction bearing the chemistry for docking to the pores.


Similarly, the sequencer component shouldn't be very complex on the wet side.  Some liquid flows to manage while generating the membranes and pores.  The major capital expanse - and bulk - is the large GPU-based compute server built in to perform data reduction, demultiplexing and other analyses.  SBX traces look like you could call them by eye, just like the textbook examples I had for calling Sanger and Maxam-Gilbert reads.  But with nearly 8 million sensors summing to half a gigabase per second, that isn't a practical approach.



The biggest expense on the sequencer side is assuredly the compute onboard.  While decoding the raw traces looks computationally simple, 500 megabases worth per second is quite a firehose to deal with.  While the current downstream analysis workflow waits for the sequencer to finish, there’s no need for that, as reads can be demultiplexed and sent down an analysis pipeline as they complete on the sensor (just as with ONT).  Supporting such downstream analysis onboard will require a lot of compute which will push the capital cost higher. 


So the sequencer ends up with a great deal of compute hardware, much like most sequencers on the market - Ultima’s UG100, Element’s AVITI24, ONT’s PromethION, Illumina’s NovaSeq X. PacBio’s Revios and so forth.   And like many of them, the compute is possibly the main driver of purchase cost.  But since it lacks the fancy optics of the short read sequencers, the SBX sequencer box should have a much lower capital cost.  Again, absolutely no guidance from Roche at this time, but I’m guessing an SBX installation will come be in the $0.4M to $0.5M range, so a cost around that of a Revio and not a UG100 or NovaSeq X.  The possible industry ramifications of such a price I’ll tackle in a companion piece.



Furthermore, an obvious marketing-driven strategy would be to offer the instrument in two or more grades of onboard compute, with a less expensive machine that doesn’t attempt the near real time secondary analysis and a full-featured machine able to run a complete pipeline in sync with data generation.  The latter would fit well with SBX’s suitability for variable run lengths with the run finishing when a particular total amount of data (or perhaps total data for each sample) is hit (Roche is calling this "run until done").


Wild!


Roche’s SBX technology is finally in the wild, and the time of completely wild speculation comes to an end.  The ASeq Discord has been a predictable hotbed for speculation  Whether to sign that NDA was a subject of some debate, with some choosing to forgo a pre-look at that price and Brian Krueger at Omics.ly cleverly negotiating down the timespan of his.  It’s also interesting how some (moi!) under NDA took it as a call to nearly absolute silence on the topic - mine on the Discord led was seen as a clear indicator I had signed - whereas Alex Dickinson reveled in dropping teasers on LinkedIn.


I signed the NDA greedily, as I'm confident in my ability to keep things quiet - or perhaps not confident in my ability to remember anything dangerous and the reward was a conversation with Stratos Genomics founder Mark Kokoris and some others on the Roche SBX team to get a preview of the presentation. I'm really glad I did, as otherwise I'd be writing furiously all day today and wouldn't have this out - or as polished - until at least Friday. Note that all images are courtesy of Roche and that they did review the text to avoid any accidental disclosures or silly factual errors on my part.


One of the stranger predictions somewhere on those threads is a prediction that Roche has no room to further optimize.  Huh??  There’s always multiple aspects of any complex chemistry plus instrumentation plus software technology which haven’t been fully explored.  


In the webinar, Mark Kokoris talked about this being a beginning, not an end. He even spoke of further increasing the number of sensors - and therefore pores - beyond the current 8 million.

As an aside, there was long a rumor that Roche would make their system a closed ecosystem only for clinical applications - they've explicitly rejected that rumor and said it will be open, "for everyone to use" and initially Research Use Only (RUO).

On the cost side, I’m sure the SBX nucleotide chemistry is a good process - but it seems from the pharmaceutical world that there’s always a better process chemistry than the one you have.


On performance, it is absolutely impractical to explore all of protein engineering space for a problem.  Processivity and fidelity of the SBX expansion polymerase will always be potential targets for further engineering success, and with increasingly sophisticated protein design models make it plausible to take great leaps in sequence space that can’t be attained by the small steps made with conventional mutagenesis approaches.


For the impatient, the approximately two hours of converting the library to expanded form upstream of a sixth that of instrument time is just not going to sit well - clearly an area to see if even small time savings can be achieved.


In the webinar in response to a question, Kokoris suggested there is room for a 5th signal state to cover epigenetic modification - though how to perform a conversion that captures 5mC or 5hmC into such a fifth type is a very interesting challenge.


The Roche duplex chemistry is simple and fast and gives consensus in the high Q30s, but if there are applications that demand even higher accuracy then the literature offers many candidates - UMI-based duplex schemes or rolling circle to generate more copies to compute over.  The latter fits well with the “midi read” length profile.


Finis


With a big splash like this, one can expect ripples with all sorts of interesting interference with each other; I look forward to reading other coverage of the webinar and what I expect will be robust discussion - I'm particularly excited to chat on this topic at AGBT next week. There’s also a bioRxiv preprint imminent.  I’ll  stir that pot a bit myself in the next two posts. Data dumps apparently won't be coming until this summer; it will be very interesting to see what oddities, artifacts and quirks independent researchers can pull out of the data.


With any new technology unveiling there are many unanswered questions.  But one in particular plays on my mind - which side of the bet did the polymerase take?


6 comments:

Allan Stephan said...

It's wonderful that the world is exposed to Mark Kokoris and the power of his mind.

David Eccles said...

When I look at new technologies, I think about where the technology can be used, and whether the technology adds anything new and interesting. In this case, I get confused.

While 1kb-ish reads will break the [reverse-complement backbone](https://bsky.app/profile/gringene.org/post/3lhxbkjmrsc27) that kills pure Illumina assemblies (and limit contigs to a few megabases), it still won't enable full-chromosome assemblies because the reads are too small to span across large tandem repeat arrays that litter the genome. Where large tandem repeat arrays don't exist, the genome is usually small enough that the reverse-complement backbone is also not present. In the high-quality, medium-length (or midi-read) sequencing space, PacBio sequencing seems to me to be a more appropriate alternative (but *all* technologies suffer from the very-long complex tandem repeats, as seen in centromeres).

With regards to sample preparation time, Illumina's PCR-free Prep claims to be able to deliver libraries in [90 minutes from extracted genomic DNA](https://www.tst-web.illumina.com/content/dam/illumina/gcs/assembled-assets/marketing-literature/illumina-library-prep-brochure-m-gl-00033/library-preparation-solutions-m-gl-00033.pdf), so a 2-hour sample prep time isn't even going to compete with that.

Base-level signal accuracy that is noticeable by eyeballing the data is great, but if that signal represents a synthesised copy, how is it any different from the other sequencing-by-synthesis methods available (or other similar replicative sequencing methods). Such methods force a particular model onto the sequencing process; you only see what the replicative chemistry allows you to see. That typically means a four-base sequencing capability for DNA. If you're lucky, there might be a specific non-standard base modification included as well (e.g. CpG methylation). But RNA is out (except through cDNA, which is itself a loss of information, and potentially introduces further replication errors), protein is out, as are any other disruptive uses.

So... what's their target market? Are they just hoping people will buy their technology because "Ooh, shiny nanopore!"?

Keith Robison said...

David:

Thank you for the thoughtful comments. A few response points

Getting high quality de novo assemblies will still certainly be the realm of the true long read technologies. I'll have more to say on 1kb reads tomorrow

The 2 hours I referred to is more analogous to clustering/polony generation on the short read platforms but I see the duplex library prep is also about 2 hours. In the end though, the ability of SBX to generate deep short/midi read quickly is far beyond any other platform - unless you count cranking all the flowcells on a PromethION P24 on the same library

Yes, it's a four base sequencer. Directly detecting base modifications is cool stuff, but still a niche market

Target markets: generating huge amounts of counting data, generating 30X short read genomes for routine purposes (see my other piece for more on this)

Anonymous said...

Enjoy AGBT Keith. It’s been a long time since we first met in that Jacuzzi in 2010 or 2012. Wish I could be there with you.

Anonymous said...

I noticed a slide describing cost of SBX over the years - which seemed to reach a value of about $30 (halfway on a log scale between $10 and $100$). I thought it said Per Genome, but they flashed this up pretty quick. Can anyone verify that? If so, even if it's THEIR cost, i.e., not marked -up selling cost and ignoring Ultima's wild claims, that would be very competitive. Thoughts?

David Eccles said...

I was trying to avoid mentioning ONT, but can't help myself now that you've opened the door. I just did a quick envelope calculation for nanopore sequencing running at "hundreds of millions of bases per second" [https://sequencing.roche.com/global/en/article-listing/sequencing-platform-technologies.html], and came up with 74 PromethION-scale flow cells. Which is (as you say) a lot.

I presume that because Roche is using sequences that are so long, they can run them through at faster pore translocation speeds (e.g. tens of thousands of bases per second) without an appreciable loss of accuracy.

I am aware that ONT has previously toyed with the idea of higher throughput (e.g. talking about voltage-sensing arrays in 2021, with MinION-scale devices running 100,000 channels, 3.9 Tb / day from a single flow cell), and that their biggest bottleneck was in managing the data output, not the technology itself. Given that, I would be intrigued to know how Roche proposes to manage the data output of ... ... 5 terabytes of processed (i.e. basecalled) data over the course of a 12-hour run. It's certainly not something that would permit raw signal data to be output.