As a quick aside, I've known Alex for a number of years now. He's been going to AGBT far longer and more regularly than I have, and he has an absolutely charming personality but can deploy some wicked sarcasm about vendor claims. He's also spoken about projects in his lab sponsored by PacBio and BioNano Genomics, and moderates excitement about new technology with realism about the challenges of clinical genetics. When my tweets last year revealed I was traveling in the Netherlands, he invited me to come to a lab tour at his facility at Radboudumc in Nijmegen - on a Sunday! Not only was if full of cool genomics tech, but it also got me the critical perk of a Dutch cycle ride - my hope of two wheeled cruising around the Arnhem battlefield was a bike shop too far, due to the day of the week But he could unlock a railroad station share bike for me (you have to live there to get a membership).
Glimpses of the Current Multitechnology Menagerie
Okay, back to the talk. To set the table, Hoischen reminded we attendees that there are over 7000 rare diseases, with 1 in 17 individuals being affected during their lifetime. About three quarters of rare disease manifests during childhood, bringing great misery on the most innocent and vulnerable. About 80% of rare disease appears to be genetic, but the cause of disease is identified in 30-50% of patients - so a huge gap between what we should be able to diagnose and can diagnose.
A recent publication from a different group at Radboudumc has some nice figures illustrating the ensemble of different technologies have developed around this. There's Sanger sequencing, Molecular Inversion Probes, PCR for specific deletions, multiplex ligation-dependent probe amplification (MLPA), Southern blots, conventional karyotyping, Fluorescent In Situ Hybridization, microarrays, short read panels, long read amplicon sequencing, short read exome sequencing, methylation arrays, optical genome mapping - and probably a kitchen drawer full of more specialized methods. I was going to emphasize the age of Southern blotting, but karyotyping is even older. And FISH ain't a spring chicken either. Many of these require special expertise - and perhaps some luck - to interpret. On one Infinity project, FISH was being used to diagnose a 12 megabase inversion - and too often the chromosomes flopped over on the slide so as to make the "split FISH" effect very ambiguous.
The paper asks the question: if short read WGS is used as the first line of analysis, how much can't be found that way which can be detected by one of the other methods? The pie charts below show the wide range - Southern blot and exome sequencing become superfluous, a bit remains for Sanger and MLPA, and karyotyping and FISH aren't really replaced at all
Proofing HiFi on Rare Genetic Disease Genomes
Hoischen poised the question: can PacBio HiFi WGS replace them all? So the Radboud will be generating 500 genomes using HiFi, with a target coverage of 30X coverage using one Revio flowcell per patient, in collaboration with PacBio. As a pilot, 100 mutation positive samples - so positive controls - but mutations hard to call from short read data - were chosen. In the next round, 100 samples that weren't diagnosable using whole exome sequencing despite having data for both the affected and both parents (trios) and 100 samples negative singletons (no parental data available).
In actuality, a mean HiFi yield of 93.8 gigabases - 29.2-fold coverage - was achieved. Each sample had seen a battery of the other tests, but not every sample had seen every older test. For example, in the 100 positive control set the most used prior modality was Sanger, and that was less then a third of the samples. Variant types were, well, varied - over half having SVs, about a third with single nucleotide variants, about a quarter with causative short tandem repeat alleles, 9 with indels <50 basepairs, and a handful with methylation or region of homozygosity. Better yet, the 100 positive controls have 128 variants of interest.
HiFi sequencing detected 95% of the known pathogenic variants - 87% automated calling and the remaining 8% requiring manual curation or manual inspection of reads. So not perfect overall and also some clear room for improved informatics - but very impressive! And some of that informatics improvement is already underway, including moving to using T2T reference genomes. Most of the misses were either long A,G-rich regions or were extremely large cytogenetic aberrations such as Robertsonian fusions.
For the test set of unsolved cases, 290 of the 400 (300 from trios plus 100 singletons) were complete at the time of AGBT. While the mean coverage was similar as before - 28.5 fold coverage aka 92.7 gigabases on average - some of the most recent runs have hit over 120 gigabases on a flowcell. If those can become standard, then it may be possible to load more than four genomes per flowcell - Hoischen mentioned two per flowcell and going for only 20X coverage, but this armchair observer would consider spreading 6 genomes across 4 flowcells - once you've gone to barcoding your samples a lot of flexibility is opened up - if your informatics can be that flexible.
The analysis of that cohort is still ongoing. But in other cohorts with severe intellectual disability, even 10-12X coverage was frequently successful at uncovering candidate disease alleles - including breakend (BND) calls in one trio suggesting a novel rearrangement on chromosome 2.
In another cohort of 293 (out of a planned 500) genomes - many for trios but others being used for families with more than one affected individuals - which are "WES and/or WES negative" on difficult ("unsolvable") genetic syndromes, 10X HiFi coverage from a single Sequel IIe flowcell has already found causal variants in 11% of the families and candidates in 3% more. Every solve here is a win for a family, but the low yield emphasizes how difficult these diseases are.
It's also worth noting the additional value the long reads bring, even when short reads had given suggestion of causative mutations. In a recent preprint on the low coverage PacBio work for the SOLVE-RD consortium, several examples were given of improved information from HiFi. For example, in one case parental data was not available but data from unaffected siblings was; this enabled conclusive deduction that a candidate causative intragenic deletion in a proteasome subunit was a de novo mutation in the affected individual - the same haplotype lacking the deletion could be found in the unaffected siblings. In another case, short read sequencing suggested that a mutation was present in titin but in evaluating its impact the parent-of-origin is important and that could not be determined from the short read data - but HiFi succeeded in phasing the deletion to the paternal allele which would be consistent with a pathogenic role. In another patient, HiFi data resolved the configuration of a gene duplication as tandem. The below shot is of a case where an entire gene has been duplicated and then replaced a gene on another chromosome.
So that's a strong argument for PacBio HiFi as the first line of whole genome sequencing for rare disease solving. The cost of running and interpreting the existing zoo of specialized tests is something amateurs such as this scribe haven't considered when comparing the cost of a HiFi whole genome to a short read whole genome. Of course, it may be that clinicians will still insist on running Sanger or MPLA or other older tests for confirmatory purposes, until they can be convinced that the HiFi can be trusted. Conversely, I remember Alex kvetching about how many false positives popped out of short read exomes or genomes that then burned resources being ruled out by other assays.
Solving The Rest
But what about the residual? The G,A-rich regions might be amenable to improvement by protein engineering on the polymerase - Hoischen thinks the polymerase is basically falling off on these as the "didn't rise to HiFi due to too many passes" appears to be enriched with these purine-rich runs. Or perhaps better informatics can leverage these sub-HiFi reads to good advantage. For example, in one case Hoischen found that not-quite-HiFi reads could be leveraged to identify a pathogenic simple tandem repeat expansion. I also got the impression from him that no current technology does well with these regions - that's worth testing more extensively but is likely true.
The SOLVE-RD preprint also points out an interpretational challenge: while good catalogs of SNP variation with annotation of allele frequencies and evidence for phenotypic effect are now available, similar databases of structural variation are far behind
On the big chromosomal aberrations, HiFi may just not be the right technology. PacBio could try to push HiFi read lengths, but probably they need to be much longer. PacBio could backtrack and promote Continuous Long Reads and try to get truly long insert libraries going again, but that's fighting a huge amount of messaging as well as technical challenge - and a lot of additional expense.
One option is to backstop HiFi with genome mapping. BioNano is the existing product in this space and continues to improve their system, but as I noted before the company is having difficulty staying afloat. And they may have direct competition: Nabsys had one of the small, more distant suites at AGBT and is moving towards launch, perhaps sometimes this year. Hi-C is another option for large aberrations, as I noted previously. Arima Genomics, Phase Genomics and Dovetail Genomics are all vying for this slice of the pie.
Oxford Nanopore is another possibility - though of course ONT won't be satisfied just cleaning up large chromosomal aberrations for rival PacBio. As ONT advances in their claims of high accuracy - and always those should be treated with caution - the possibility of ONT providing a truly comprehensive solution will be there. But, that will require demonstrating this utility across similar size and difficulty cohorts as the PacBio studies described above - and what weak spots will be revealed by such a study? The agenda for next week's London Calling has multiple sessions on rare diseases, so I hope to have a better read on ONT's status here after that.
Solving the unsolvable may also require bringing more 'omics to the table. One advantage of the long read technologies is the inherent ability to detect methylation and even phase it; one case Hoischen cited showed aberrant methylation on one chromosome.
Applying long read chromatin assays, which I recently covered, may be one approach to become more routine - particularly since in contrast to short read ATAC-Seq these will often be phased. Ultra-deep RNA-Seq - I mentioned Pengfei Liu's work in my Ultima piece - is another approach, though it may require development of robust iPSC technologies to generate patient-specific simulations of tissues (such as brain) which can't be ethically accessed. But what else? Clinical geneticists in this space already deal with genetic mosaicism - what other "this wasn't in the first year genetics textbook" systematic phenomena remain to be discovered?
Applying long read chromatin assays, which I recently covered, may be one approach to become more routine - particularly since in contrast to short read ATAC-Seq these will often be phased. Ultra-deep RNA-Seq - I mentioned Pengfei Liu's work in my Ultima piece - is another approach, though it may require development of robust iPSC technologies to generate patient-specific simulations of tissues (such as brain) which can't be ethically accessed. But what else? Clinical geneticists in this space already deal with genetic mosaicism - what other "this wasn't in the first year genetics textbook" systematic phenomena remain to be discovered?
I wonder how many of the remaining undiagnosed diseases are evident in the sequencing data but the genetic signature hasn't been worked out yet...
ReplyDelete