Wednesday, May 01, 2024

First Illumina Complete Long Reads Preprint

Readers of this space might have detected a significant slant towards skepticism in my coverage of Illumina Complete Long Reads (iCLR), exacerbated by now deposed Illumina CEO Francis deSouza claiming it isn't a synthetic read technology.  Illumina's posters on iCLR at AGBT this year seemed to reinforce my view that Illumina was marketing purely on short-read like terms - call SNPs in a few more hard-to-map regions of the genome, but not really compete head-to-head with the true long read platforms.  But now there is a preprint out on MedRxiv that reports iCLR results for a Genome In A Bottle (GIAB) sample as well as seven samples from individuals wiith potential genetic diseases of unresolved cause.  The GIAB sample was also sequenced with some of the latest Oxford Nanopore chemistry (Duplex R10.4.1) and as HiFi libraries on PacBio Revio - enabling comparisons of the platforms.  The preprint is probably going to be revised and expanded - I'm certainly hoping some of my comments are found constructive - but is very useful to see.  And perhaps it will soften positions such as mine on iCLR's utility.

How iCLR Works

CLR uses a clever strategy to generate synthetic reads with a simple workflow.  

DNA is first tagmented with a low concentration of transposase so that very long fragments - but fragments which can be amplified with long PCR - are generated.  Next, these are very lightly mutagenized, or as Illumina and the authors describe it, "landmarked".  Landmarked fragments are then amplified by long PCR, followed by a round of typical tagmentation to generate libraries which go through PCR to add sequencing indexes and the completed iCLR library is sequenced to some depth (more on this later).  A standard Illumina library is generated from the original sample and sequenced to 30X coverage. 

An informatics pipeline then clusters the iCLR reads based on landmarks.  The preprint describes the DRAGEN platform (Illumina's FPGA-accelerated system) using two approaches.  For reads which can be uniquely mapped, variant calling vs. the reference is used.  For reads that don't map uniquely, a kmer approach is used.  Reads clustered by this are then assembled into long reads, and landmarks removed by comparison to the standard Illumina library.  

I will throw out my first serious criticism here.  A very similar landmarking methodology was developed in Michael Wigler's lab at Cold Spring Harbor Laboratory, with publications I could find in 2014 and 2018 (I could have sworn there are even earlier ones); neither is cited.

GIAB Metrics

GIAB sample HG002 was used for benchmarking.  This is certainly the most analyzed human genome ever, tackled by about every sequencing platform and library preparation methodology as well as optical mapping.  If you're serious about a new method, HG002 must be your first target!  And all the HG002 data in this preprint is available freely; the reads from clinical samples for appropriate reasons of individual privacy are being made available only with a more restrictive framework.





Table 1 in the preprint describes a number of typical metrics for variant calling.  I'll throw out my standard pet peeve: when you're comparing 0.9995 % to 0.9991% accuracy, it's time to switch from linear-scaled accuracies to log-scaled error rates (phred scores/Q-values).  Each category of variant (SNV, INDEL and SV) was compared against two different benchmarks.  I'm not familiar with the benchmarks; time to do some more reading.  But one big take home is that for F1 score on SNVs iCLR beat HiFi on both metrics, while splitting with ONT.  Another is that for both databases iCLR won on F1 for indels, though dropping to 91.80% for the one benchmark.  iCLR performed comparatively poorly for Structural Variants, hitting F1 scores of only 79.76% and 74.94%; ONT and PacBio split honors here but always over 90%.  

Another metric looked at 4600 challenging and medically important exons and whether 10X coverage of mapped reads was achieved.  All three technologies excel out to about the 4500th gene and then iCLR and HiFi mostly overlap as they drop; only a handful (about 20) lack 10X coverage with Nanopore.  This is ascribed to the significantly longer reads from nanopore.  These differences are ascribed largely to differences in read length: the read N50s were 33.8Kb for ONT, 16.9Kb for HiFi and 6.54Kb for iCLR.  It's interesting that 6.54 vs. 16.9 doesn't give much advantage but getting to 33.8 makes a difference for about 80 genes.  A fun little student project could be to take the ONT dataset and progressively trim it (random chunks from each end) and see how much shorter the ONT data could be and still be superior to PacBio and iCLR - which could also be seen as what read N50 should PacBio be aiming for to close the gap.



The supplemental figures have some more comparison graphs - there isn't a plot of the read length distributions for each which some will find interesting.  Curiously, PacBio has a highly variable proportion of SNVs phased by chromosome - the violin plot is tall and skinny - whereas iCLR and ONT are more consistent with more blobby violins.  Much of this is on haplotyping statistics.  For example, PacBio has a better switch error rate than ONT which is better than iCLR, whereas for blockwise Hamming error it's PacBio then iCLR and then ONT - and ONT has the largest variance.  For max phase block size, ONT wins, PacBio has about one third the length and iCLR half of that

A table in the supplement gives iCLR stats such as N50s across all the samples - they're relatively consistent ranging from 6351 to 6733.  

A running question with iCLR is how much oversequencing is required - the landmarked reads can only be assembled by having overlaps between short reads and that's a sampling process.  This raw "into iCLR" read statistic is not given (it should be!!!), but the Materials and Methods plus Supplemental Table 2 state that two of the clinical iCLR libraries were run on one lane of a NovaSeq S4 flowcell in 2x151 mode (the others were run on HiSeq X).  NovaSeq S4 specs says 2x150 mode should generate 2400-3000 gigabases of data (let's ignore those extra 2 cycles for simplicity! - it's less than 1% error) total or 600-750 gigabases per lane.  If we call a 30X human genome 100 gigabases of data, then iCLR requires about 6-7.5fold oversampling.  

One interest in the oversampling is the cost; if HiFi on Revio is about 5X more expensive than NovaSeq X 25B per gigabase of data (~$10 vs ~$2 per 1 gigabase), then iCLR does cost 20-50% more than a HiFi genome for raw data generation - labor for library construction, capital cost, training costs, etc are ignored there.  

Rare Disease Samples

On the seven rare disease samples, one was found with iCLR to have 16kb duplication which is in a gene known to cause similar symptoms.  So that's a win!  But going back to available Nanopore and HiFi data, it could also be found there - so not a singular win for iCLR and a reminder that tools for spotting these aberrations aren't perfect - otherwise it would have been found already.  The preprint does not explore whether increased coverage of this region is apparent with just short read data.

In a still gestating (a euphemism for procrastinated upon) piece on Alex Hoischen's AGBT talk on PacBio and rare disease genomes, a focus is on the varied types of difficult genetic regions that frustrate short read whole genome sequencing and some that frustrate others.  For example, HiFi couldn't resolve a Robertsonian fusion in that set - which is hardly surprising but still points to limits of long reads.  Almost certainly iCLR would fail on this too.  What other classes?  A zoom on what iCLR can do with painful regions such as long simple repeats would be interesting.  Or telomeres and centromeres.  Those would all make interesting additions to this preprint -- though perhaps they deserve a separate piece focused on them.  And it raises the question whether there should be a GIAB-like set of difficult clinical aberrations - large expanded simple repeats, translocations and Robertsonian fusions, variants in polypurine tracts, etc. for benchmarking both wet and dry tools.  Such samples are going to be much harder to keep anonymous, which might be a serious hinderance to very widespread distribution and usage.  But solutions have been found for sharing identifiable clinical samples, so it shouldn't be an absolute hurdle.

So iCLR is here and maybe Illumina will get serious in pitting it against true long reads.  If PacBio and ONT are paying attention, they should be now on alert of the areas where CLR is serious competition and where it may have weaknesses such as the oversampling requirement.  

4 comments:

Keith Robison said...

Someone has forwarded me a much older citation on mutation marking methods for sequencing
- from 2004

Anonymous said...

iCLR was most likely trained on a bunch of human cell line PacBio data to make this possible. Long and short read sequencing on a single platform would be great— completely would eliminate the need for maps. Depends on upper and lower limit for variant size.

Anonymous said...

https://www.medrxiv.org/content/10.1101/2024.05.03.24305331v1

Anonymous said...

More expensive than HIFI and NO information on methylation.... Would also like to see de novo assemblies from the three datasets and how they compare....