What Illumina Said
Chemistry X promises to give twice the read lengths with half the cycle time and three times the accuracy. Entirely new chemistry at every stage; new blocking group, new dyes, new polymerases, new deblocking chemistry. The chemistry is promised to be more resistant to heat, a "50X reduction in hydrolysis", "3X faster block cleave" and "greatly reduced prephasing / phasing errors" and the new polymerase has been engineered for high fidelity and fast incorporation of the new "X-Nucleotides".
Backwards Compatible or New Instruments?
The first question occupying many minds is whether Chemistry X will be associated with existing boxes or new instruments, with financial types salivating at the thought of a new upgrade cycle and the associated revenue bump for Illumina and sequencing scientists hoping for no new instrument so they aren't required to shell out to provide that revenue bump.
I'm leaning towards the not requiring a new box, or perhaps just a small swap-out. Perhaps the new dyes require tweaks in the optical system, but it would be surprising that much would change on the fluidics. Plus, many of the newer Illumina instruments have most (all?) of the fluidics in the disposable cartridge. But conversely maybe something like different temperature control is needed.
How Much Faster in Reality?
One thing to remember about faster cycling: cycling isn't all of the instrument time. Clustering is a sizable component of each run and the "paired end turn" takes time too as does imaging. The plot below shows the times for the different supported modes on the NovaSeq for each flowcell type -- it would appear clustering is around 9-10 hours of the total time as that would be the Y-intercept of the three lines. Caveat: I've asked before about what causes the bends in these lines and was told to take all these numbers with a bit of salt. The bottom of the S4 curve is the only non-paired end sequence yet doesn't deviate from the trend, so either the paired end turn takes no time or these numbers really do include reading dual barcodes (which requires the turn) and that would improve with faster chemistry.
How Useful Are 2X Read Lengths?
Okay, I'm a bit wishy-washy here. Longer reads are definitely useful for certain applications and increasing the read length can increase the amount of data if you are willing to wait for it. But the patterned flowcells on the newer instruments appear to dislike library inserts longer than about 500 basepairs, so doubling read lengths from say 150 to 300 isn't really getting you 600 bases of data but rather 500 with a lot of overlapping sequence that is somewhat redundant. Now, if they keep older bridge amplification devices like MiSeq that tolerate long library inserts better -- or update them -- and apply this chemistry to them (is 2x600 possible?) that could be interesting for some high value applications like MHC amplicon sequencing.
A Direct Shot at Singular?
Illumina has presumably been developing Chemistry X for a long time, but teasing it now and the aspects being teased would seem triggered by Singular Genomics' instrument specs, which were claiming higher speed and higher accuracy.
An aside: I'm going to attempt to differentiate the terms "linked read" and "synthetic read" in this piece. Linked reads enable determining that multiple reads came from the same molecule, but do not go beyond that -- this definition would encompass not only the discontinued 10X linked read technology, BGI's LFR and Universal Sequencing Technologies TELL-Seq but also Hi-C methods. Synthetic reads actually try to reconstruct the input molecules -- LoopSeq from Loop Genomics falls into this category.
What Illumina Said
Infinity is promised to offer 10X throughput versus "legacy long read technology" (that use of "legacy" seems like throwing shade) with read lengths up to 10 kilobases and input requirements 10% of existing technologies.
Is It Longas MorphoSeq?
Given the lack of any technical details, speculation soon mounted on that front. Illumina previously (and briefly) commercialized the Moleculo synthetic read technology and has published several times on Continuity Preserving Transposition (CPT) linked read technology but never commercialized it. A number of very clever people noted that Australian company Longas, which was developing a synthetic read technology called MorphoSeq, seems to have vanished from the internet yet key former Longas employees are now employees of Illumina. Did Illumina stealthily acquire Longas?
If it is MorphoSeq, it is a clever approach that solves some theoretical issues with other synthetic read approaches. MorphoSeq first involves tagmenting input DNA to generate fragments of the desired size that have barcodes on each end. These are then mutagenized by limited cycle PCR with a nucleotide analog to create mutations scattered throughout the molecules that will serve as landmarks. So now sequence that was monotonously repetitive isn't so repetitive any more. Bottleneck these to create a desired level of diversity an then convert them (probably by another round of tagmentation) into sequencing libraries for readout on a short read sequencer.
By finding reads with the same landmark mutations, one can stitch together the original molecules -- if the bottlenecking is done correctly and the pool is oversampled sufficiently.
Looking at the slide deck presented in June 2019 by Longas founder (and now Illumina employee) Aaron Darling to the Long Read Club (with presentation & interesting Q&A captured on YouTube), some nice data is shown for assembling microbes of various G+C content -- though it does engage my pet peeve that the GC contents sampled are 33%, 43%, 51%, 65% and 66%. Warp Drive had me constantly living in 73% G+C land of the Streptomyces, and one of the strain factory founders has a love of Mesoplasmas that are around 25% G+C. Please, if you are testing any DNA-based technology, such as sequencing or amplification, and want to make claims around G+C percentages, cover the whole range! If you want suggestions, I'm happy to give exact species and even know where to get some of them for free!
Illumina seems to be pitching Infinity as able to access "10%" of the genome that isn't accessible to short reads. One interesting possibility around the MorphoSeq approach is that it may be capable of resolving simple repeats longer than the final library fragment length. With continuing interest in trinucleotide and other VNTR disorders -- as long as the repeat arrays are marked with mutations at good intervals, the Longas approach should be able to resolve them.
But what is the final accuracy? In the Long Read Club presentation Longas presented some limited, high level data on assembling bacteria but not a lot of details where this broke down.
A catch with any synthetic read approach is that one must oversample. The Long Read Club presentation talks about requiring just 15X coverage with long reads along with typical coverage of short reads, but it would appear that these synthetic reads in turn require at least 7X oversampling - which seems a bit low, given a crude rule-of-thumb I have for the Lander-Waterman statistics but then again full reconstruction of molecules might not be required if some clever informatics can take advantage of synthetic scaffolds. So a human genome with such a short plus synthetic strategy might be around 135X actual coverage to achieve -- plan that in your budget! This is likely to narrow or erase any cost advantage over long read platforms.
The PCR in the Longas approach would also erase any methylation marks, which Illumina doesn't really have a channel to address anyways. Presumably one could use Bisulfite or one of NEB's clever enzymatic conversions to preprocess the fragments prior to an amplificaiton -- perhaps best upstream of the first tagmentation. The Long Read Club presentation didn't address any implications of such an approach -- e.g. would the mutagenesis require further tuning to get an ideal level of marking within converted DNA.
I don't work enough with human (or other mammalian genomes) to have a quick grasp of what the gains are for analysis by having longer and longer haplotype information; I'm sure Oxford Nanopore is going to be trying to promote the idea that 10 kilobase units misses a lot of information, particularly when looking at methylation marks.
In any case, we don't have any benchmark data yet to compare the Longas / MorphoSeq approach to see what the accuracy is for calling various sizes of structural variants or if there is any degradation in SNP calling due to incomplete removal of the introduced mutations. Small indels have been real trouble for short reads -- does this approach greatly boost their detection?
Outside of human genomics, de novo assembly seems likely to get a measurable but incomplete boost. Metagenome assembly is probably another good use.
Long read platforms have been popular for uses such as 16S amplicons in which the entire library has molecules that are very similar to each other. Whether the MorphoSeq approach will work well here remains to be seen. Transcriptome elucidation is another popular use (e.g. PacBio IsoSeq); this wasn't discussed in the Long Read Club presentation. Would the varying size of transcripts be an issue for tuning the marking rate? There would also be a loss of longer than 10kb transcripts -- a small but fascinating space -- at Warp we were initially focused on rapamycin-like clusters which feature one polyketide synthase ORF that is generally about 10 thousand amino acids long -- and it's in an operon with another PKS that is only on the order of 6 thousand amino acids long. Of course, that is nothing on the titin in our muscles, with spliced transcripts over 100 kilobases in length.
How Involved is the Workflow?
The Long Read Club presentation does claim a workflow of around 6 hours that is automation-friendly. It also notes that sample tagging occurs early, so multiple samples may be pooled to reduce the amount of handling farther downstream. It will be interesting to see what product Illumina ends up using for the 10kb size selection step -- I will make the wild leap of logic to suggest that it won't be a Circulomics product since that is now owned by PacBio.
A 6 hour workflow, which apparently still leaves time for prawn on the barbie, would compare very favorably to PacBio's standard workflow and certainly much faster than PacBio's workflow using amplification. It seems similar in time to ONT's ligation workflow -- but an eternity compared to ONT's rapid workflows.
Did Illumina Fudge A Bit on Input Requirements Comparison?
Illumina is claiming Infinity will require only 10% as much input DNA as "legacy long read platforms", but since they never say how much input we can't really compare. But I suspect this is a less than fully honest comparison, perhaps comparing the most input hungry competitor library prep to some value for Longas. PacBio's Ultra Low Input workflow, which involves PCR and two rounds of ligation, can start with as little as 5 nanograms (which is fewer than 1000 human genome equivalents). ONT's latest amplification-free methods in beta push the input requirements down to 50 nanograms.
Of Course, That Could All Be Wrong
There is the possibility that Infinity isn't MorphoSeq, that it's just a coincidence that Longas disappeared and Aaron Darling is now at Illumina -- perhaps he's there but came up with some other clever approach. I suppose we'll have to wait and see.
[2022-01-13 -- aaargghhhh! fixed a trailed off sentence in the first paragraph under Utility]
[2022-01-14 -- per gas station's comment, fixed the "half the cycle speed" to "half the cycle time"]