In the prior piece, I covered the technical details unveiled by Roche for their SBX technology, but generally tried to avoid predicting its effects on the marketplace. Here I put on the pundit’s hat. The TL;DR is this is a major new sequencing platform and if you’re at one the competitors you have about a year before it fully hits the market - though in reality the action has already started as Roche starts grabbing hearts-and-minds. What can we anticipate about the effect on each of the current players? As noted in the prior piece, some key aspects - in particular purchase price and run cost - aren’t being disclosed by Roche and complicate prognostication.
Omics! Omics!
A computational biologist's personal views on new technologies & publications on genomics & proteomics and their impact on drug discovery
Thursday, February 20, 2025
Roche Xpounds on New Sequencing Technology
Bar bets can be a powerful force in human society. One of the best known books on the planet, The Guinness Book of World Records, originated from the need to equitably settle wagers. Many entries in that tome are questions of immense scale - the largest this or heaviest that. Shortly before this posted, Roche unveiled a sequencing technology that per its inventors may be the result of such a bar bet: how large a dangling bit can you stick on a nucleotide and still have it incorporated by a polymerase.
Monday, January 27, 2025
Olink Reveal: Focused Proteomics, Simplified
I’ve covered a lot of genomics in this space, but there is an inherent challenge to studying biology via DNA - DNA is the underlying blueprint, but that blueprint must pass through multiple steps before actual biology of interest emerges. RNA-Seq gets closer, but much of the real action is at the level of proteins (though much is not - let’s not forget all the metabolites!). When I set out in this space 18 years ago, I thought I’d cover more proteomics but that didn’t materialize - time to plunk one piece on the proteomics side of the ledger!
Proteomics has multiple challenges, but two inherent ones are the diversity of proteoforms and the dynamic range within the proteome.
The diversity of proteins within a human is astounding, even if we discard the inherently hypervariable antibodies and T cell receptors which have specific means of diversification within an individual that include random generation of sequence during VDJ recombination and somatic hypermutation of antibodies. The rest of the bunch are subject to transcript-level diversification by features such as alternative promoters, alternative splicing and RNA editing and then another wealth of post-translational proteolysis, phosphorylation, glycosylation and a heap more covalent modifications. If we really wanted to make things complex, we’d worry about protein localization, who a protein is partnered with and even alternative protein conformations - but let’s just stick to primary proteoforms and a diversity that is estimated in excess of 1 million different forms.
The key part here is that there is no analytical method capable of resolving all of these. Any proteomics method is to some degree ignoring much of the proteome entirely, and for many other proteins compressing many forms into a single signal. Indeed, most proteomic tools look at very short windows of sequence or perhaps patches of three dimensional structure, and will rarely if ever be able to directly connect two such short windows or patches - they will be stuck correlating them. The key takeaway here is that all proteomics methods work on a reduced representation of the proteome.
The dynamic range in the proteome is astounding, with some potentially challenging effects. For example, blood serum is utterly dominated by a handful of proteins such as serum albumin, beta 2 microglobulin and immunoglobulins - for methods that look at the total proteome there is a serious danger of flooding out your signal with these abundant but relatively dull proteins and not being able to seen interesting ones such as hormones that are many logs lower in concentration.
Proteomics has been dominated by mass spectrometry, which has had over three decades to develop into a mature science. Mass spec is inherently a counting process and on its own can’t focus or filter out the dull stuff. Even more so, you don’t fly intact proteins in a mass spec, but peptides and there’s only a few useful proteases out there. Peptides don’t ionize consistently, so that adds a layer of challenge to quantitation. But as noted, this has been an intensely developed field for multiple decades and so there are very good mass spectroscopy proteomics techniques using liquid chromatography (LC-MS) and other methods to remove abundant dull proteins and fractionate complex peptide pools into manageable ones.
But, protein LC-MS is very much its own discipline, and most proteomics labs aren’t strong in genomics or vice versa - though there are certainly collaborations or dual-threat labs. LC-MS setups require serious capital budgets for the instruments and their accompanying sample handling automation and highly skilled personnel.
A number of companies are attempting to apply the strategies of high throughput DNA sequencing to peptide sequencing or identification. Quantum-SI is the only one to make it to market but there are other startups out there such as Erisyon are plugging away. These methods look a bit like mass spectrometry in their sample requirements, as they will also be counting peptides - and the current Quantum-SI doesn’t count nearly enough to be practical for complex samples such as serum or plasma.
The other “next gen proteomics” - one lesson not learned from the DNA sequencing world is the problem of calling something “next-gen” - this year will be the 20th anniversary of the commercial launch of 454 sequencing - approach is to use affinity reagents such as antibodies or aptamers and tag them with DNA barcodes, then sequence those barcodes on high throughput DNA sequencers. By using affinity reagents, the problem of boring but abundant proteins goes away – just don’t give them any affinity reagents. Dynamic range can be addressed as well - the exact details aren’t necessarily disclosed by manufacturers but one could imagine only labeling a fraction of a given antibody to tune how many counts are generated from a certain concentration of targeted analyte.
Olink Proteomics, now a component of Thermo Fisher, is one company offering a product in this space. Olink’s Proximity Extension Assay (PEA) relies on two antibodies to each protein of interest and requiring hybridization between the probes on both antibodies to enable extension by polymerase in order to generate a signal. This increases the specificity of the signal and tamps down any signal from non-specific binding - or from just having antibodies in solution.
Olink has released a series of panels targeting increasing numbers of target proteins in the human proteome. This is generally a good thing - except counting more proteins means generating more DNA tags which means a bigger sequencing budget per sample. The other knock on Olink’s (and their competitor SomaLogic, now within Standard Biotools and also marketed by Illumina) approach is a complex laboratory workflow that mandates liquid handling automation. So this has meant that the big Olink Explore discovery panels are inevitably going to be run at huge genome centers that have both the big iron sequencers and the liquid handling robots that are required. And this strategy has started paying out scientific dividends - some of which were covered by the Olink Proteomics World online symposium last fall that featured speakers such as Kari Stefansson. Olink’s and Ultima’s recent announcement on starting to process all of the UK Biobank is an example of such grand plans, and this will be run at Regeneron’s genome center.
Academic center core labs and smaller biotechs often power important biomedical advances, but if Olink Explore is only practical with NovaSeq/UG100 class machines and fancy liquid handlers, then few of these important scientific constituencies will be able to access the technology. Which would be unfortunate, since small labs often cultivate very interesting sample sets that very large population-based projects like UK Biobank might not have. Large population based and carefully curated small projects are complementary, but is only one able to access Olink’s technology?
And that’s where Olink’s newest product, Olink Reveal, comes in, enabling smaller labs to process 86 samples. First, a select set of about 1000 proteins is targeted, bringing the required sequencing for a panel of samples plus controls to fit on a NextSeq-class flowcell - only 1 billion reads required. Second, the laboratory workflow has been made very simple and practical to execute with only multichannel pipettes. The product is shipped with a 96-well plate that contains dried down PEA reagents; simply adding samples and controls to the wells activates the assay for an overnight incubation. The next day, PCR reagents are added to graft sample index barcodes onto the ligation products and then that is pooled to form a sequencing library. The library prep costs $98 per sample (list price) - $8,428 per kit. Throw in sequencing costs of $2K-$5K per run (depending on the instrument) and this isn’t out-of-line for other genomics applications.
Of course, this is a reduced representation over the larger “Explore” sets But Olink has selected the proteins to be a useful reduced representation. They’ve used sources such as Reactome to prioritize proteins, and have also prioritized proteins that have been shown to have genetically-driven expression variability in the human population - protein QTLs aka pQTLs. If the new panel is cross-referenced to studies using the larger panels, most of these studies would have found at least one protein showing statistically significant change in concentration. This can be seen in the plot below, where each row is a study colored by disease area. On the left is the distribution of P-values for the actual Olink Explore data and the right the same data filtered for proteins in the Olink Reveal panel.
It’s also robust - Olink has sent validation samples to multiple operators and compared the results, and the values from each lab are tightly correlated.
So Olink with their affinity proteomics approach is basically following the same playbook as genomics did with exomes. When hybrid capture approaches for exome sequencing first came out, it was thought these would be used for only a few years and then be completely displaced by whole genome sequencing (WGS). But exomes have proven too cost effective - even with drops in WGS costs, it is still possible to sequence more samples with exomes for the same budget. Yes, that risks missing causal variants outside the exome target set was always a concern – the recent excitement around lesions in non-coding RNAs such as RNU4-2 have demonstrated that - but many investigators saw exomes as enabling studies that otherwise wouldn’t happen. Plus sometimes the bigger worry is biological noise obscuring a signal you could see and that is dealt with by more samples.
The new Olink Reveal product fills a gap between Olink’s large “Explore” discovery sets and very small custom panels. In the Proteomics World talks many speakers described work run with PEA panels of only two dozen or so targets, often using PCR as a readout rather than sequencing. This shows one bit of synergy in the Olink acquisition by ThermoFisher, as Thermo has an extensive PCR product catalog including array-type formats. Thus PEA follows the well worn patterns in genomics: huge discovery panels for some studies, high value panels that balance cost and coverage for many studies and focused custom panels for validating findings on very large cohorts. The Proteomics World talks even suggested some of these focused panels might soon be seriously evaluated as in vitro diagnostics. With developments like these, targeted proteomics via sequencing will be a very interesting space to watch.
Wednesday, January 22, 2025
Illumina & NVIDIA Team to Remake How to Train Your DRAGEN
Wednesday, November 06, 2024
Benchtop HiFi: PacBio Unveils Vega
Revio Refresh
Wednesday, October 09, 2024
MiSeq Makeover
Friday, September 27, 2024
QuantumScale: Two Million Cells is the Opening Offer
I'm always excited by sequencing technology going bigger. Every time the technology can generate significantly more data, experiments that previously could only be run as proof-of-concept can move to routine, and what was previously completely impractical enters the realm of proof-of-concept. These shifts have steadily enabled scientists to look farther and broader into biology - though the complexity of the living world always dwarves our approaches. So it was easy to say yes several weeks ago to an overture from Scale Bio to again chat with CEO Giovanna Prout about their newest leap forward: QuantumScale, which will start out enabling single cell 3' RNA sequencing experiments with two million cells of output- but that's just the beginning. And to help with it, they're collaborating with three other organizations sharing the vision of sequencing at unprecedented scale: Ultima Genomics on the data generation side, NVIDIA for data analysis, and Chan Zuckerberg Initiative (CZI) which will subsidize the program and make the research publicly available on Chan Zuckerberg Cell by Gene Discover.
Scale Bio is launching QuantumScale as an Early Access offering, originally aiming for 100 million cells across all participants - though since I spoke with Prout they've received over 140 million cells in submitted proposals. First 50 million cells would be converted to libraries at Scale Bio and sequenced by Ultima (with CZI covering the cost), with the second 50 million cells prepped in the participants labs with Scale Bio covering the library costs (and CZI subsidizing sequencing cost). Data return would include CRAMs and gene count matrices. Labs running their own sequencing have a choice of Ultima or NovaSeq X - the libraries are agnostic, but it isn't practical to run these libraries on anything smaller. Prout mentioned that a typical target is 20K reads per cell, though Scale Bio and NVIDIA are exploring ways to reduce this, so with 2M cells that's 40B reads required - or about two 25B flowcells on NovaSeq X.
How do they do it? The typical Scale Bio workflow has gotten a new last step, for which two million cells is expected to be only the beginning. The ScalePlex reagent can be first used to tag samples prior to the initial fixation, with up to 1000 samples per pool (as I covered in June). Samples are fixed and then distributed to a 96-well plate in which reverse transcription and a round of barcoding take place. Then pool those and split into a new 96-well plate which performs the "Quantum Barcoding", with around 800K barcodes within each well. Prout says full technical details of that process aren't being released now but will be soon, but hinted that it might involve microwells within each well. Indexing primers during the PCR add another level of coding, generating over 600 million possible barcode combinations. This gives Scale Bio, according to Prout, a roadmap to experiments with 10 million, 30 million or perhaps even more cells per experiment - and multiplet rates "like nothing".
As noted above, the scale of data generation is enormous, and that might stress or break some existing pipelines. Prout suggested that Seurat probably won't work, but scanpy "might". So having NVIDIA on board makes great sense - they're already on the Ultima UG100 performing alignment, but part of the program will be NVIDIA working with participants to build out secondary and tertiary analyses using the Parabricks framework.
What might someone do with all that? I don't run single cell 3' RNA experiments myself, but reaching back to my pharma days I can start imagining. In particular, there are a set of experiment schemes known as Perturb-Seq or CROP-Seq which use single cell RNA readouts from pools of CRISPR constructs - the single cell data both provides a fingerprint of cellular state and reveals which guide RNA (or guide RNAs; some of these have multiple per construct) are present.
Suppose there is a Perturb-Seq experiment and the statisticians say we require 10K cells per sample to properly sample the complexity of the CRISPR pool we are using. Two million cells just became 200 samples. Two hundred seems like a big number, but suppose we want to run each perturbation in quadruplicate to deal with noise. For example, I'd like to spread those four cells around the geometry of a plate, knowing that there are often corner and edge effects and even more complex location effects from where the plate is in the incubator. So now only 50 perturbations - perhaps my 49 favorite drugs plus a vehicle control. Suddenly 2M cells isn't so enormous any more - I didn't even get into timepoints or using different cell lines or different compound concentrations or any of numerous other experimental variables I might wish to explore. But Perturb-Seq on 49 drugs in quadruplicate at a single concentration in a single cell line is still many orders of magnitude more perturbation data than we could dream about two decades ago at Millennium to pack into three 96-well plates.
And that, as I started with, is the continuing story: 'omics gets bigger and our dreams of what we might explore just ratchet up to the new level of just in reach.
The announcement of QuantumScale also has interesting timing in the industry, arriving a bit over a month after Illumina announced it was entering the single cell RNA-Seq library prep market with the purchase of Fluent Biosciences. While nobody (except perhaps BGI/MGI/Complete Genomics) makes their single cell solution tied exclusively to one sequencing platform, the connection of Scale Bio and Ultima makes clear business sense - Illumina is now a frenemy to be treated more cautiously and boosting an alternative is good business. Ultima would of course love if QuantumScale nudges more labs into their orbit, and these 3' counting assays perform very well on Ultima with few concerns about homopolymers confusing the results (and Prout assures me that all the Scale Bio multiplex tags are read very effectively) . And as is so often the case, NVIDIA finds itself in the center of a new data hungry computing trend.
Will many labs jump into QuantumScale? Greater reach is wonderful, but one must have the budgets to run the experiments and grind the data. PacBio in particular and to a degree Illumina have seen their big new machines face limited demand - or in the case of Revio the real possibility that everyone is spending the same money to get more data (great for science, not great for PacBio's bottom line). But perhaps academic labs won't be the main drivers here, but instead pharma and perhaps even more so the emerging space of tech companies hungry for biological data to train foundation models - sometimes not even having their own labs but instead relying on companies such as my employer to run the experiments.
A favorite quote of mine is from late 1800s architect Daniel Burnham; among his masterpieces is Washington DC's Union Station. "Make no little plans. They have no magic to stir men's blood and probably will not themselves be realized." I can't wait to see what magic is stirred in women's and men's blood by QuantumScale, which is certainly not the stuff of little plans.
[2024-10-02 tweaked working around how program is funded]