Monday, October 03, 2022

Illumina Roadmap Part 2: Infinity Becomes Illumina Complete Long Reads

The Only Thing Clear About Infinity Is It Is Now Complete Long Reads. 
Illumina told us a new name for Infinity -- Illumina Complete Long Reads -- and an initial pair of products, but didn't reveal anything new about the underlying tech.  They threw out a number of claims, but very vague ones.  Particularly confusing is that it "isn't synthetic reads".  If not, then what is it?  

A Pause for Spinning Definitions

Vendors in this space will try to play with definitions and it isn't clear there is consensus.  I'm going to try to careful sketch out the definitions I work with, in the hopes that everyone will fall around me of course.

Single Molecule Sequencing

All of the post-Sanger techniques start with a single molecule, but many of them acquire their actual signal from an ensemble of clonal copies of that original molecule.  True single molecule sequencing acquires the signal from operations directly on a molecule, with no replication and no ensemble.  

By this definition, three single molecule systems have been commercialized: Pacific Biosciences, Oxford Nanopore and Helicos (now SeqLL)

Linked Reads

Linked reads are a set of reads which are derived from an input molecule but which are insufficient to reconstruct the molecule.   Some linked read technologies give ordering between the linked reads; most just say they came from the same molecule.  Probably the best known linked read technology is the one 10X Genomics once offered but has discontinued, but there is also Long Fragment Read (in various incarnations) from Complete Genomics, TELL-Seq commercialized by Universal Sequencing Technology and the various flavors of Contiguity Preserving Transposition published by Illumina and academic collaborators but never commercialized.  There's also lots of published but not commercialized stuff.  One of the more interesting of these was a scheme in which long molecules were bound to a non-patterned flowcell, tagmentation was performed on the surface and then sequencing happened -- adjacent clusters could be inferred to be from the same molecule and a degree of ordering information could be obtained by tracing the path of each molecule.  The downside is that this isn't a very information-rich system on the flowcell; to work the long molecules must be relatively sparsely applied to the surface.

PacBio's strobe reads are an example of a linked read technology that is also single molecule. With strobe reads, the illumination and imaging were flipped from off to on and back on again; each "strobe" of active imaging gave sequence somewhat disconnected from the prior strobe.  The distance in bases between strobes could be estimated -- but if SMRTbell libraries were used one monkey wrench is you couldn't know if the polymerase had "gone 'round the horn" and was now reading from the other strand.  Strobe reads were also terrible for multiplexing, since it could be easy to miss the barcode during all the strobing -- and even the first strobe wasn't guaranteed to pick up a barcode.  Once PacBio could reliably get read lengths of multiple kilobases, the headaches of strobe made no sense and they dropped it many years ago -- which is why it is very strange that Illumina kept saying Illumina Complete Long Reads aren't strobe reads.  

Also odd is that Illumina would pick something with the same abbreviation as a PacBio product: we now have PacBio's "Continuous" vs. Illumina's "Complete" as the C in CLR.  Since there are many publications on PacBio's CLR, it seems a poor choice by Illumina to give their product a name that when googled will naturally direct interested parties towards their rival.

Synthetic Reads

Which brings us to synthetic reads, which Illumina claims Complete Long Reads are not.  In my book synthetic reads are the case in which one uses some sort of oversampling of short reads to reconstruct individual input molecules.  LoopSeq, now owned by Element Biosciences, is an example though alas the molecular details haven't been disclosed.  Illumina's failed productization of the Moleculo technology as TruSeq Synthetic Long Reads is another.  Singular has published on a synthetic read approach, though it appears to be restricted to synthetic biology constructs due to very specific requirements for an internal sequence island.  

Back to Illumina Complete Long Reads Questions

Now, back to reading the tea leaves hoping for some CLaRity - but perhaps I should be wishing for CLaiRvoyance instead?  Okay, I'll stop the annoying capitalization forth with!

Two workflows were announced .  The first is a straight up human WGS assay available in Q1.  It wasn't clear whether this really cares about what species the input is or not -- but likely if you feed it anything else much of the information processing is on your own.  Or perhaps, as one person suggested to me, the reaction requires tuning for a particular G+C content or other parameters and is not universal.   Illumina did not talk about precision or sensitivity of SV calling with this assay.  Nor did they go into detail as to how well it covers the human genome -- they talk about capturing the "5%" not reachable with short reads being the value proposition.  What fraction of the T2T reference can be reliably mapped with Illumina CLR ?  Especially not discussed was cost or efficiency -- with Illumina CLR how many reads will a human WGS require and what multiple of a standard WGS is this?  And can CLR and standard libraries be mixed on the same flowcell without penalty?  For example, TELL-Seq libraries are a poor mix with standard libraries because only one index is used for sample identity and the other index must be read around 20 cycles to capture the bead-specific barcode.

The second workflow as described as arriving late next year (so perhaps with Auld Lang Syne playing at the launch party) and will be a targeted assay for human intended to only deliver the 5%; this would complement their standard WGS workflows such as PCR Free.  The fact that Illumina thinks this will sell suggests that the read oversampling of Illumina CLR is substantial.  It also means Illumina thinks labs aren't putting a high premium on comprehensive haplotyping and structural variant calling.

Not addressed at all this time, and I think unaddressed at the AGBT presentation, is performance on really tough sequences -- and such data would provide info on the inner workings.  For example, what is the longest homopolymer which can be reliably resolved?  The longest simple nucleotide repeat, such as CAG array in huntingtin gene>

Patents might not even deliver revelation -- much of Longas error-prone PCR approach which Illumina CLR is suspected to rely on was anticipated by publications which predate Longas forming.  It is possible that Illumina is going to protect the innards of CLR as trade secrets rather than reveal anything with weak patent filings (weak if the core technologies are unpatentable due to prior art).

One possible hint of interest is the claim that CLR sequences are mostly in the 6-7 kilobase range but reads over 30 kilobases are seen.  Long-range PCR at 30 kilobases isn't unheard of but is very difficult to run reliably.  Either the approach doesn't actually use amplification or the process uses something like rolling circle amplification or multiple displacement amplification which can handle very long fragments.  Or it could be that long PCR is getting better -- a vendor just emailed me (spam, not personalized) for a mastermix that is claimed to support amplifications of 20kb or more.

Back to teases.  Illumina kept repeating that the workflow is not complicated, hinting that other workflows are.  If you look at LoopSeq for example, it's a multi-day process with multiple bead cleanups.  But claiming your workflow isn't complicated isn't the same as revealing that workflow for independent eyes to assess.

My Current Thoughts

Illumina CLR really is looking like a modest product aimed primarily at keeping existing human WGS customers in the fold.  No clear support for growth markets such as protein engineering, synthetic biology or agriculture were high quality long reads have clear value.  The guarded information on cost suggests that the human WGS assay will not have a clear price advantage -- and perhaps a price disadvantage -- versus single molecule long read offerings from PacBio and Oxford Nanopore.  And no clear option of getting methylation information, which both PacBio and ONT can extract from the same experiment; Illumina may be pegging methylation as simply not critical from WGS studies.    

But it's dangerous to underestimate Illumina's technical team and even more so their marketing department, so we'll see how much Illumina CLR can hold back the growing tide of true single molecule long reads

[2022-10-04 12:48 -- fixed Molecular to Moleculo -- stupid autocorrect!]


Anonymous said...

was it " "5%" not reachable with short reads" or "5% of genic regions"

Anonymous said...

How can a 5-6kb read on a 2x150 sequencer not be synthetic? And why do they calculate an N50 if they are not building contigs?

gasstationwithoutpumps said...

Where does your classification scheme classify things like Chicago libraries?

Anonymous said...

Thanks Keith, you’re the best!