Tuesday, February 21, 2023

Is Illumina Delivering the MVP of Long Reads?

At AGBT last week Illumina released additional details on their still incubating Complete Long Reads (CLR) product (formerly known as Infinity) but is still holding back both some interesting technical information as well as exact performance specifications.  Illumina is already floating some of their marketing messages, which in some cases are dependent on some of those still-in-flux specifications and some of the claims may not withstand careful scrutiny.  And Illumina continues to make statements that irritate anyone with deep technical knowledge of the long read space.  The reaction by attendees was definitely mixed - one long read aficionado even offered me a very spicy title suggestion for this entry.  Alas, I can't use it, as it would be a bit of an inside joke based on a portion of a presentation that the presenter asked not be tweeted.  So instead you get the above title,  which may not be what you think. 

How you interpret my title probably depends on your experience with long read technologies and desire (or lack there of) to run a strictly Illumina shop.  To those used to PacBio or ONT data, Illumina CLR is a Minimum Viable Product, and you'll likely either say that with an acidic accent on "Minimum" or putting air quotes around "Viable" (or perhaps both).  If you like the idea of sticking to Illumina and avoiding complicating your lab's workflow with multiple instruments (and the investment in multiple instruments), then perhaps you will view it as the Most Valuable Player in this space.  And I would say that proponents of the true long read platforms - and particularly the manufacturers of them - should pay attention to some aspects of CLR and how Illumina is positioning it; it would be foolish for the competitors to only sneer at this offering.

Illumina CLR is indeed based on the Longas MorphoSeq approach as long speculated.  The workflow consists basically of a round of tagmentation to install PCR priming sites, some form of mutagenesis to install "landmark" mutations on the fragments, long range PCR to amplify the landmarked fragments, and then that DNA flows into a standard Illumina tagmentation library prep workflow.  Note that sample barcoding cannot occur until the second tagmentation.  The exact nature of landmark generation is one of the secrets Illumina is still holding back on.  In the literature, both weak bisulfite treatment and mutagenic PCR have been used.  One interesting possibility is to have a thermolabile and mutagenic enzyme such as a cytosine deaminase in the first tagmentation reaction; this would be heat killed in the initial PCR.  

The straightforward workflow will definitely be a selling point; the true long read competitors still have workflows that are complex and automation of them is not yet widespread.  

After sequencing, software onboard the DRAGEN FPGA component identifies reads with landmarks and locally assembles them to form either synthetic reads or linked reads.  Which brings to the first complaint from the cognizant: would Illumina please stop claiming CLR isn't a synthetic read technology?  Until the NovaSeq kicks out multi-kilobase reads from the flowcell, multi-kilobase reads coming off NovaSeq are synthetic reads.  Stop trying to convince anyone otherwise - and if you've convinced yourself Illumina start unconvincing yourself stat!

That also brings us to another flaw with Illumina's messaging around CLR: they haven't carefully reviewed their materials.  As pointed out by Mick Watson on Twitter, the diagram they present shows a single landmark  - and then a bunch of reads being magically grouped without any landmark guidance.  That's going to be brutally confusing to any potential customer trying to check their understanding of the technology against the Illumina marketing materials.

While I'm picking on Illumina marketing, who was the genius who decided that the initialism for this product should be CLR, one already in use by competitor PacBio.  Well, not much in use anymore -- Illumina has named their product to collide with an obsolete PacBio offering.  It will certainly lead to user confusion, as they search the literature and find papers on CLR that are the PacBio offering.  Heck, there will be future papers talking about PacBio CLR, given the long incubation of some academic publications.

And one more potshot: why did CEO Francis de Souza say this fall "these aren't strobe reads"?  Strobe reads are another PacBio offering that was discontinued around a decade ago.  Who markets by saying "our product is definitely not the obscure, obsolete product that nobody used?"

Back to Illumina CLR.  Illumina is making some interesting claims around the length of synthetic reads generated.  While most of the reads are in the 5-6Kb range, there are some going out to around 30kb, which is pretty impressive for long range PCR

Illumina is also emphasizing h ow little DNA  - as in 10 nanograms - can go into the process and claiming this as competitive advantage.  But this is also one of the parameters they haven't locked down yet, so their 'X less than competitors" must be taken with a block of salt.  And it's questionable already, though one needs to look carefully at the available library prep methods for PacBio and Oxford Nanopore to be certain -- and very carefully.  For example, PacBio has an ultralow input that can take around 10 -- but isn't recommended for genomes as complex as human. ONT's rapid barcoding kit requires only 50 nanograms of input according to the Nanopore Store entry.  There's also the wonderfully titled fresh preprint "Dancing the Nanopore Limbo" which explores generating useful data from the Zymo mock community from as little as 1 nanogram of input DNA.

Illumina is also trying to claim they don't require any special DNA preps.  Which is true for the competitors as well, though for PacBio the yield per run (and therefore cost per gigabase and runs required to complete a genome) are dependent on fragment length.  But you get what you pay for - if the fragments are on the short side you won't get as much long read information.

Illumina did show some interesting data on deliberately abusing samples or spiking them with contaminants and then successfully generating CLR data.  This all fits into their narrative of "buy CLR and you don't need to change much that you are doing".  Of course, some samples are just beyond hope -- if you want long-range information from FFPE samples your only hope is proximity ligation, as the DNA itself is sheared to tiny pieces once you reverse the FFPE process.  PacBio and Nanopore are sensitive to DNA quality and DNA repair mixes are often parts of the protocol -- but it isn't inherently obvious why nicks in DNA wouldn't be an issue for that first long-range PCR (I suppose a nick in only one strand might be tolerated).  But it would be smart for the competitors to produce similar data. on DNA deliberately abused to create nicks, abasic sites and other lesions.

A huge differentiator between Illumina CLR and the true long read platforms is that Illumina is first targeting only human genome sequencing.  It isn't clear whether this is primarily a requirement of the DRAGEN software or if aspects of the workflow (perhaps the landmarking?) must be tuned for sample complexity or GC content - more things Illumina hasn't disclosed publicly.  Sure a lot of people are focused on human DNA samples, but there's lots of other interesting uses for true long reads, whether it is transcriptomes (with ONT having the only direct RNA offering), de novo genomes, and much more.

Illumina also hasn't yet released final specifications on how much sequencing capacity will be required to generate a human genome.  The CLR approach requires a degree of oversampling that may be in the 8-10 fold range, with a similar increase in sequencing costs and reduction in sequencing capacity if measured in genomes per sequencer per unit time. The second product launch will be a targeted kit that focuses CLR on a limited set of difficult regions of the genome, thereby reducing the oversampling burdent .  Details of this workflow were not announced.

As far as performance, Illumina seems to be focused on SNP detection statistics and correctly calling SNPs in difficult regions and generating haplotype blocks.  I failed to take proper notes, so can't remember how much was discussed in terms of structural variants -- I thought they did bring up some examples but others who watched claimed they didn't.  But in any case, that will be a huge drawback of the targeted approach: losing the ability to detect structural variants in untargeted regions.  Certainly this won't threaten BioNano's push into replacing cytogenetics; I was told by an experienced person that none of the clinicians they work with trust inversion calls from short read data due to being burned by abundant false positives.

Illumina has a curious history in the long read space.  They went and bought Moleculo back in 2013, marketed it as a synthetic long read product but not very well and it was mostly forgotten.  I actually saw a paper using it just a day or so ago, which sort of stunned me.  In collaboration with academics, Illumina published multiple papers on Contiguity Preserving Transposition (CPT), finally developing a single tube version -- which looks a lot like Universal Sequencing Technology's TELL-Seq and BGI/MGI/Complete Genomics' stLFR approaches -- using bead-linked transposases with a bead-specific barcode added.  Illumina supports TELL-Seq analyses on BaseSpace.  But Illumina never marketed CPT.   

As an aside, why hasn't TELL-Seq gotten more love?  It does require a very unusual read format and so cannot be mixed with other library types,  But it still seems like a useful strategy for Illumina sequencing of genomes and metagenomes yet there's only two or so citations in PubMed (though there are a bunch of algorithm papers on BioRxiv - so a slice of the informatics community loves it).  An obvious explanation is that most people focused on this sort of data have decided that true long reads are the way to go; if so that bodes ill for the commercial success of Illumina CLR.

Why not?  One notable difference between Illumina CLR and CPT-Seq/TELL-SEq/stLFR approaches is the more complex manufacturing process for individually barcoded beads.  The Moleculo process and early versions of CPT required many pipetting steps per sample, so it's more clear why those were never pushed hard by Illumina.

Back to Illumina CLR.  Illumina also presented glowing quotes from beta testing sites on the ease of use of the CLR workflow.  But someone who I trust claimed to me that every one of the named CLR testers has purchased a PacBio Revio, so it isn't clear Illumina has really convinced these users that CLR beats HiFi.

So in summary, CLR is moving forward but we still don't know a many key bits of detail.  For labs that prefer to stay focused on human genome sequencing on Illumina, CLR offers a probably pricey option to augment their capabilities by providing a limited degree of long range information.  Whether such labs will choose to use CLR initially on samples or use it only as a reflex method for retesting a limited subset of samples is an interesting trend to watch for. 

For the competition, the messaging around input amounts, DNA sample quality and particularly workflow simplicity and scalability should be taken very seriously.  The fact that Illumina CLR is a simple workflow that is very automation friendly and scalable to thousands of samples to be pooled (even if nobody will do that for human) should also be seen as threats - there just aren't yet widespread sutomated protocols (some of my co-panelists on the recent GEN Live AGBT panel would definitely call the protocols "clunky") for the true long read platforms and their barcode spaces are often limited. Exploring library generation robustness in the face of abused or contaminated samples should also be a priority for the true long read competitors.  There's also the important but admittedly difficult challenge of shifting the data yield and accuracy conversations away from purely SNP calling (very Illumina-friendly) to structural variant determination.  And I would expect the true long read vendors to hammer away at the limited applicability of the initial Illumina CLR products when their platforms can run on the gamut of biological species, both genomes and transcriptomes, as well as being applied to plasmid verification and various synthetic biology applications.

1 comment:

Shawn McClelland said...

I really enjoy your insightful comments on the state of the art of sequencing. Thanks!

Why do you think automation isn't advancing like the chemistries, workflows, and analyses (maybe it is and I just feel like it isn't)? I was in the lab before and understand, and enjoyed, the nuances of reproducibility - so it makes sense to come up with simple workflows. But at the expense of more powerful tools? Certainly it's easier to market a simple workflow, but why can't automation break that bottleneck? Step motor quality, robot flexibility/programmability, environmental effects (humidity, static,...). Is it just not lucrative or hard to sell a really good robot/automation? I know many places are selling hyper specific automation/chemistry/workflow/analyses packages - there's pros and cons there.

:) - Would love to hear your thoughts. Thanks!

- Shawn McClelland