I hate this image, because there is no obvious link between fragment blocks 2 and 3, nor 3 and 4. This particular fragment would not be reconstructed. https://t.co/TZwRIbz1tm
— Mick W@tson 🐄 (@BioMickWatson) February 8, 2023
While I'm picking on Illumina marketing, who was the genius who decided that the initialism for this product should be CLR, one already in use by competitor PacBio. Well, not much in use anymore -- Illumina has named their product to collide with an obsolete PacBio offering. It will certainly lead to user confusion, as they search the literature and find papers on CLR that are the PacBio offering. Heck, there will be future papers talking about PacBio CLR, given the long incubation of some academic publications.
And one more potshot: why did CEO Francis de Souza say this fall "these aren't strobe reads"? Strobe reads are another PacBio offering that was discontinued around a decade ago. Who markets by saying "our product is definitely not the obscure, obsolete product that nobody used?"
Back to Illumina CLR. Illumina is making some interesting claims around the length of synthetic reads generated. While most of the reads are in the 5-6Kb range, there are some going out to around 30kb, which is pretty impressive for long range PCR
Illumina is also emphasizing h ow little DNA - as in 10 nanograms - can go into the process and claiming this as competitive advantage. But this is also one of the parameters they haven't locked down yet, so their 'X less than competitors" must be taken with a block of salt. And it's questionable already, though one needs to look carefully at the available library prep methods for PacBio and Oxford Nanopore to be certain -- and very carefully. For example, PacBio has an ultralow input that can take around 10 -- but isn't recommended for genomes as complex as human. ONT's rapid barcoding kit requires only 50 nanograms of input according to the Nanopore Store entry. There's also the wonderfully titled fresh preprint "Dancing the Nanopore Limbo" which explores generating useful data from the Zymo mock community from as little as 1 nanogram of input DNA.
Illumina is also trying to claim they don't require any special DNA preps. Which is true for the competitors as well, though for PacBio the yield per run (and therefore cost per gigabase and runs required to complete a genome) are dependent on fragment length. But you get what you pay for - if the fragments are on the short side you won't get as much long read information.
Illumina did show some interesting data on deliberately abusing samples or spiking them with contaminants and then successfully generating CLR data. This all fits into their narrative of "buy CLR and you don't need to change much that you are doing". Of course, some samples are just beyond hope -- if you want long-range information from FFPE samples your only hope is proximity ligation, as the DNA itself is sheared to tiny pieces once you reverse the FFPE process. PacBio and Nanopore are sensitive to DNA quality and DNA repair mixes are often parts of the protocol -- but it isn't inherently obvious why nicks in DNA wouldn't be an issue for that first long-range PCR (I suppose a nick in only one strand might be tolerated). But it would be smart for the competitors to produce similar data. on DNA deliberately abused to create nicks, abasic sites and other lesions.
A huge differentiator between Illumina CLR and the true long read platforms is that Illumina is first targeting only human genome sequencing. It isn't clear whether this is primarily a requirement of the DRAGEN software or if aspects of the workflow (perhaps the landmarking?) must be tuned for sample complexity or GC content - more things Illumina hasn't disclosed publicly. Sure a lot of people are focused on human DNA samples, but there's lots of other interesting uses for true long reads, whether it is transcriptomes (with ONT having the only direct RNA offering), de novo genomes, and much more.
Illumina also hasn't yet released final specifications on how much sequencing capacity will be required to generate a human genome. The CLR approach requires a degree of oversampling that may be in the 8-10 fold range, with a similar increase in sequencing costs and reduction in sequencing capacity if measured in genomes per sequencer per unit time. The second product launch will be a targeted kit that focuses CLR on a limited set of difficult regions of the genome, thereby reducing the oversampling burdent . Details of this workflow were not announced.
As far as performance, Illumina seems to be focused on SNP detection statistics and correctly calling SNPs in difficult regions and generating haplotype blocks. I failed to take proper notes, so can't remember how much was discussed in terms of structural variants -- I thought they did bring up some examples but others who watched claimed they didn't. But in any case, that will be a huge drawback of the targeted approach: losing the ability to detect structural variants in untargeted regions. Certainly this won't threaten BioNano's push into replacing cytogenetics; I was told by an experienced person that none of the clinicians they work with trust inversion calls from short read data due to being burned by abundant false positives.
Illumina has a curious history in the long read space. They went and bought Moleculo back in 2013, marketed it as a synthetic long read product but not very well and it was mostly forgotten. I actually saw a paper using it just a day or so ago, which sort of stunned me. In collaboration with academics, Illumina published multiple papers on Contiguity Preserving Transposition (CPT), finally developing a single tube version -- which looks a lot like Universal Sequencing Technology's TELL-Seq and BGI/MGI/Complete Genomics' stLFR approaches -- using bead-linked transposases with a bead-specific barcode added. Illumina supports TELL-Seq analyses on BaseSpace. But Illumina never marketed CPT.
As an aside, why hasn't TELL-Seq gotten more love? It does require a very unusual read format and so cannot be mixed with other library types, But it still seems like a useful strategy for Illumina sequencing of genomes and metagenomes yet there's only two or so citations in PubMed (though there are a bunch of algorithm papers on BioRxiv - so a slice of the informatics community loves it). An obvious explanation is that most people focused on this sort of data have decided that true long reads are the way to go; if so that bodes ill for the commercial success of Illumina CLR.
Why not? One notable difference between Illumina CLR and CPT-Seq/TELL-SEq/stLFR approaches is the more complex manufacturing process for individually barcoded beads. The Moleculo process and early versions of CPT required many pipetting steps per sample, so it's more clear why those were never pushed hard by Illumina.
Back to Illumina CLR. Illumina also presented glowing quotes from beta testing sites on the ease of use of the CLR workflow. But someone who I trust claimed to me that every one of the named CLR testers has purchased a PacBio Revio, so it isn't clear Illumina has really convinced these users that CLR beats HiFi.
So in summary, CLR is moving forward but we still don't know a many key bits of detail. For labs that prefer to stay focused on human genome sequencing on Illumina, CLR offers a probably pricey option to augment their capabilities by providing a limited degree of long range information. Whether such labs will choose to use CLR initially on samples or use it only as a reflex method for retesting a limited subset of samples is an interesting trend to watch for.
For the competition, the messaging around input amounts, DNA sample quality and particularly workflow simplicity and scalability should be taken very seriously. The fact that Illumina CLR is a simple workflow that is very automation friendly and scalable to thousands of samples to be pooled (even if nobody will do that for human) should also be seen as threats - there just aren't yet widespread sutomated protocols (some of my co-panelists on the recent GEN Live AGBT panel would definitely call the protocols "clunky") for the true long read platforms and their barcode spaces are often limited. Exploring library generation robustness in the face of abused or contaminated samples should also be a priority for the true long read competitors. There's also the important but admittedly difficult challenge of shifting the data yield and accuracy conversations away from purely SNP calling (very Illumina-friendly) to structural variant determination. And I would expect the true long read vendors to hammer away at the limited applicability of the initial Illumina CLR products when their platforms can run on the gamut of biological species, both genomes and transcriptomes, as well as being applied to plasmid verification and various synthetic biology applications.
I really enjoy your insightful comments on the state of the art of sequencing. Thanks!
ReplyDeleteWhy do you think automation isn't advancing like the chemistries, workflows, and analyses (maybe it is and I just feel like it isn't)? I was in the lab before and understand, and enjoyed, the nuances of reproducibility - so it makes sense to come up with simple workflows. But at the expense of more powerful tools? Certainly it's easier to market a simple workflow, but why can't automation break that bottleneck? Step motor quality, robot flexibility/programmability, environmental effects (humidity, static,...). Is it just not lucrative or hard to sell a really good robot/automation? I know many places are selling hyper specific automation/chemistry/workflow/analyses packages - there's pros and cons there.
:) - Would love to hear your thoughts. Thanks!
- Shawn McClelland