Omics! Omics!: Is Midi Read Sequencing A Thing?

Saturday, February 22, 2025

Is Midi Read Sequencing A Thing?

Throughout the two pieces on Roche SBX sequencing I sprinkled the term “midi read sequencing”. Here I’m going to explore in more detail this concept.

In the beginning, all sequencing was short read - Wu and Kaiser is generally credited (though sadly not by the Nobel committee) as the first DNA sequencing paper, working out 20 bases of the cohesive ends of lambda phage.

Later there came Maxam-Gilbert and Sanger sequencing, the first “high throughput” methods. Most reads for these technologies were initially in the few hundreds of basepairs, though then Sanger approaches went a bit longer. During the Human Genome Project, a little company called Li-Cor had a sequencer that used labels that fluoresced in the infrared - less background there - and they coaxed out reads a bit over a kilobase.

Later came 454 and other “next generation sequencing” approaches, which also started short. One early project for 454 was on ancient human DNA, which I found clever as the technology was then limited to about 100 basepairs and many archaeological DNA fragments weren’t much bigger - the theoretical advantage of fluorescent Sanger was thereby blunted. 454 would later top out close to a kilobase, though only a minority of reads in a run would get to such lengths.

Other NGS methods also started short; the Solexa team published in NAR a paper showing how much you could do with 25 basepair reads! Somehow the teams at Illumina kept pushing their technology to 150 basepairs along with facile paired end technology, and so the benchmark in the industry evolved to “2x150” paired end sequencing. SOLiD never got beyond 2x75 IIRC - and the first paired end approach on SOLiD had a systematic basecalling issue on the reverse read which I discovered but never got around to publishing or publicizing - the tech was pretty dead before I got up steam. Ion Torrent also showed up with initially single reads, and later some sort of paired end scheme, though that arrived when interest in the platform had largely faded.

On MiSeq, Illumina pushed things further, first to 2x250 and then 2x300. 2x250 would show up again on HiSeq 2500

Around that time, true long read sequencing started showing up - first PacBio and then ONT in 2014. Reads that were reliably first a few kilobases, then tens of kilobases and nowadays over a hundred kilobases for ONT is semi-common and those megabase whales have been generated with some frequency.

And so the market has divided into short read and long read sequencing. Short read sequencers are all (save Helicos, which didn’t get far) clonal systems and so limited in read length by dephasing and polony issues, and single molecule long read sequencers which didn’t have polonies and couldn’t dephase and so could go very long.

Both ONT and PacBio could in principle be used as short read sequencers. On PacBio it is terribly inefficient, since data yield is closely related to library insert size. ONT was a credible possibility here, though early on short fragments would have basecalling issues and I always blamed (unfairly?) short fragments for killing flowcell productivity. But realistically, in most short read applications there is a huge value placed on gross numbers - ONT just couldn’t compete here

Is there a middle ground market between short reads and long reads? I will claim here that such as “midi read” market does exist. It will probably never command dedicated sequencing technologies tuned to it, but it does represent at least a border ground between short and long read schemes - particularly now with Roche SBX sequencing pushing read lengths far beyond typical short read platforms but not into much of the long read space.

What are examples of midi read applications? Perhaps the biggest one would be whole transcriptome sequencing, particularly single cell sequencing. If you want to do this now, you’re going to use either ONT or PacBio Iso-Seq. Since PacBio data yields are so dependent on insert length, and most mRNAs (sorry titin!) are much shorter than what HIFi can handle, PacBio offers Kinnex kits to generate concatemer libraries - each initial HIFi read carries multiple cDNA payloads which can be parsed out by the SMRTLink software

But Kinnex is a bunch of extra laboratory steps. It’s very clever, but still extra steps. Hence the attraction of instead using ONT cDNA sequencing (you could go the direct RNA route, but the throughput is much lower).

Now SBX is yet another option, with reads up to 1.2kilobasepairs. Looking at PacBio’s Kinnex literature, in human mRNA datasets the mean mRNA lengths are around 1.7kilobasepairs. That’s a bit longer than what SBX does well on, but that’s an average - there’s going to be many inserts that are 1.2kb or shorter. It also is a question of “how important is getting a truly full length cDNA insert?”. In some cases, critical - if one must read all the way to far end to read critical barcoding information, that’s a problem. In other cases, maybe it is acceptable - having a 1200 basepair tag on an mRNA might catch a lot of splicing. The advantage of SBX is almost frightful levels of data production - instead of less than 10M reads per Revio flowcell or perhaps maybe getting several multiples of that on PromethION, SBX looks capable of delivering 60M reads in the 1200 class per hour - and far more of the slightly shorter ones.

Another possible target for midi read sequencing is protein engineering. A dream outcome for any protein engineer is to devise some sort of assay where a large pool of candidates goes into the assay and a smaller pool comes out. This could involve actually coupling cellular survival to your engineered gene, surface display, ribosome display, FACS, etc. And if you’re doing this, you’d like to be able to sequence at huge scale the entire protein of interest. Since many interesting proteins (sorry Cas9!) are under 400 amino acids long, these could fit in the SBX format - and generate perhaps 100s of millions of read per hour of runtime.

Other applications? In synthetic biology we often have combinatorial assemblies of parts which must then be read to know which parts are connected with which and what barcode they are tied to - with the barcode one can use a short amplicon to infer the entire assembly in a downstream assay. So now SBX could deliver Saganesque “billions and billions” of reads of the short assay - if you get the long sequence. And again, it isn’t uncommon for these barcode association projects to be under 1.2 kilobases.

Ribotype - rRNA sequencing - is another possible midi read market. You can ribotype with amplicons that work in short read sequencers, particularly in the extended formats like 2x300. But the longer the amplicon, the better the taxonomic resolution. There are kits on the market that slurp up entire ribosomal RNA operons for sequencing on true long read platforms. But there are more midi-like options - PacBio has a 16S Kinnex kit which generates 1.5 kilobase amplicons - just outside SBX’s range. But again, if you got all but the last 300 bases, would that be useful - especially if you get many fold more reads for significantly less wet lab work?

Other sorts of amplicon sequencing could benefit as well. For example, during the SARS-CoV-2 pandemic a large number of amplicon sequencing approaches were developed. With Oxford Nanopore, amplicons in the 400-500 basepair range were very common, and there were some designs going around 1200 bases IIRC. For Illumina platforms, there were three general classes of approach. Some used very short amplicons that fit well into 2x150 formats, with the downside that accuracy might be dependent on position in the amplicon. Plus that meant more amplicons, which is harder to design, more expensive to order primers for, more subject to amplicon interactions that suppress some amplicons and a host of other issues. Some made longer amplicons and tagmented them, which solved the read length problem but could lead to coverage biases depending on position in the amplicon - Tn5 doesn’t like to insert close to the end of a fragment. That can be dealt with by having i5-bearing and i7-bearing primers in your primer pool - but the dose of these must be tuned or you’ll never see reads in the middle of the amplicon. Finally, some labs went with 2x300 and longer amplicons - though again, often there were regions covered only on one strand. Being able to generate billions of 500 basepair reads could find quick application in pathogen surveillance amplicon assays. Roche during the last years of 454 sequencing talked about developing very long amplicon assays for HLA haplotyping.

One market I’ve seen brought up is de novo assembly, an application dear to my heart but realistically not a large one. I’d love to try feeding 1000 bp reads at gigantic scale into a metagenomic assembler, but to be realistic this is a space where true long reads will still rule. Even in bacterial genomes, difficult repeats such as ribosomal RNAs and transposons - not to mention my favorites the polyketide synthases - are much larger than midi reads. Perhaps in metagenomics the vastly more reads wins some sensitivity for detecting rare species, but midi length reads won’t save things

It should be noted that another approach to these applications has been synthetic read approaches, with LoopSeq (acquired by Element Biosciences) being perhaps the only surviving commercial synthetic read kit (Moleculo we hardly knew ya!). LoopSeq can go up to about 5 kilobases, but shorter is better - and many uses are ribotyping and protein engineering. LoopSeq is a multiday protocol; my crew did not see great options at automating it. LoopSeq also involves the potential for generating chimaeras during its PCR steps.

Element’s AVITI has some other interesting ways to slip into midi read space. Some of my colleagues have seen success with custom recipes so that read 1 goes for 400 cycles. But any of the longer read modes on AVITI generate lower numbers of reads, and can be sensitive to wide variation in insert (and therefore polony) size and don’t seem compatible with long inserts.

On the long insert issue, AVITI can use inserts up to about 1200 basepairs due to its polony amplification scheme - polonies from long inserts are quite large and that means lower numbers of polonies allowed. The combined data quality degradation from long inserts and long read lengths apparently just means not getting good data, and mixing long inserts and short inserts isn’t recommended. Still, for some applications in synthetic biology, having this option in your pocket is useful - sometimes the variable sequences you care to associate are separated by long, constant spacers . Element has also published data in conjunction with the DeepVariant team showing how long inserts can assist in genotyping near repetitive regions of human or other genomes.

But SBX has inherent advantages here, though AVITI likely will have better data quality in at least part of the sequence. SBX quality will be uniform over the entire read, whereas AVITI will start high but eventually plunge below useful due to dephasing. SBX also doesn’t care if you give it a library of wildly heterogeneous size; there’s not going to be a way for short inserts to interfere with long inserts other than perhaps a kinetic advantage in loading and therefore a sampling bias introduced. On AVITI, a mix of polony sizes (polony size is related to insert size) - at least at this time - degrades image processing. For example, the signal each cycle from short inserts will be brighter than long inserts, since the polony will have more copies of the short insert. It could be worse; my team advised us never to put amplicons bigger than about 500 basepairs onto patterned array Illumina flowcells, as they saw a significant dropoff in performance.

And the numbers for SBX here are what Roche has now; there’s always the possibility of further optimization of the processivity (or perhaps reassociating polymerases with incomplete strands?) in the expandomer copying step to push the numbers of midi reads much higher. Roche also discussed driving sensor numbers higher, to multiply the yield of all size classes in a run.

To be honest, we only have limited data on SBX performance in the midi range. One of the presented slides presented the distribution of read lengths from a single cell experiment run for an hour: 3.9B total reads, 2.5B over 400 basepairs, 1.7B over 600, 0.7B over 800, 0.2B over 1000 and 0.06 over 1200 - but what was the underlying distribution of the cDNA library? Perhaps this is good enough, but I’d be interested in seeing the distribution if you feed SBX a library where you know every insert is 2 kilobases in length.

Have I sold you that there is a distinctive class of midi read applications? Or do you still see the world as bifurcated between short reads and long reads? The comments, as always, are open.

5 comments:

Anonymous said...: Long-read sequencing requires a large amount of input DNA, which is often impractical for clinical samples, small organisms or environmental metagenome samples. If SBX doesn't require more DNA input than Illumina, it will be a preferred choice to many. The Celera/Venter genome was assembled with 800bp paired-end reads. It is crappy but you can get something out of 1kb reads at least for less complex genomes. On (sc)RNA-seq, Kinnex is biased towards short transcripts and is bad at quantification. SBX also has a market here. Transcriptome assembly will be much easier and more meaningful with 1kb reads (I don't really buy short-read transcriptome assembly).; Saturday, February 22, 2025 7:40:00 PM
Anonymous said...: No; Sunday, February 23, 2025 3:55:00 AM
Jack Leonard said...: Here is what ChatGPT 4.o thinks of you "midi reads" moniker:

Yes, "midi reads" is likely more appealing to the market than "medium reads." Here’s why:

Established Scientific & Commercial Precedent

"Midi" is already used in various technical and commercial contexts (e.g., midi plasmids, midi prep, and even midi skirts), making it feel familiar and intuitive.
"Medium reads" sounds more generic and less like a distinct category, while "midi reads" feels more like a branded or specialized term.
Concise & Catchy

"Midi" is shorter and punchier, making it easier for branding and marketing.
Scientific and biotech communities tend to favor precise, memorable terms.
Alignment with Sequencing Trends

"Mini" and "Maxi" are often used in biotech (e.g., MinION, Maxi prep), so "Midi" follows a logical naming convention.
It subtly suggests a mid-range capability without needing further explanation.
If you’re considering branding a new category of sequencing reads, "Midi Reads" has stronger marketability and recall compared to "Medium Reads."; Sunday, February 23, 2025 3:20:00 PM
David Eccles said...: Determining sample composition from shotgun metagenomic sequencing works well with midi reads; 1-3 kb has good discriminating power, and not much is gained from going longer. That's assuming you don't want to do any metagenome assemblies, but sometimes even then, depending on the repetitive nature of the genomes.; Monday, February 24, 2025 12:21:00 AM
jesse Hoff said...: Midis are definitely relevant. There are lots of mendelian alleles that have some kind of structural variation underpinning them that would sorely benefit from 2-3x length and fall outside the realm of short reads.; Wednesday, February 26, 2025 6:42:00 PM

Omics! Omics!

Saturday, February 22, 2025

Is Midi Read Sequencing A Thing?

5 comments:

Google meta tag

Get new posts by email: