Omics! Omics!: DoveTail Transposes Their Hi-C Methodology

Technologies vying for state-of-the-art in human genome analysis are a recurrent theme in this space, and there are many ideas on this in the collection I really need to get out over the next two weeks before my brain is overwhelmed by London Calling. Up today: Dovetail Genomics popping back on the scene (as a subsidiary of Cantata Bio) with an AACR poster several weeks ago showing early results from a "LinkPrep" kit that will commercialize tagmentation (in vitro transposition to fragment DNA and add adapters) for Hi-C library generation, with the promise of enabling short read sequencing to deliver both SNVs as well as long-range structural information all from the same library.

I believe it was Kevin McKernan who commented via Twitter on my Illumina Complete Long Reads piece that it illustrated the power of transposase technology for genomics. Being able to manipulate in vitro transposition has yielded many dividends, and illustrates the power of basic research with no obvious applications to yield to amazing applications - when William Reznikoff was determining the DNA sequence requirements for Tn5 transposition in the early 1980's any sort of DNA sequencing was still in its infancy. These days Illumina's Nextera and various seqWell products yield standard short read libraries, ONT's Rapid Kits enable long read sequencing. seqWell announced at AGBT a transposon-based shearing-and-tagging "LongPlex" scheme to go upstream of PacBio library prep, and there is at least one academic report of using transposition to construct HiFi libraries. ATAC-Seq for chromatin profiling relies on in vitro transposition; so does Universal Sequencing Technology's TELL-Seq linked read technology as well as Complete Genomics' similar single-tube Long Fragment Read (stLFR) approach. Confession: I probably spend a few hours every week attempting to come up with clever new ways to apply transposons to sequencing. And I've hardly done justice above to all the successful attempts - and I won't claim to even know all of the clever adaptions.

A funny but relevant story (but you'll have to wait a bunch of text to see the relevance) about my first practical experience with tagmentation. Warp was getting fired up and we had a small batch of Streptomyces genomes to sequence. One of the local academic core labs was willing to run our samples and try the relatively new Nextera kits from Illumina, which had recently acquired Epicentre, who had commercialized the tech - one the most strategic acquisitions in Illumina's history. So we submitted DNA and were waiting the 6 or so weeks for results. I eagerly tried to assemble the data - and got zip. One sample had partial data available in Genbank, the ~90kilobase gene cluster that makes the compound FK506. So I mapped the reads to it - and discovered the problem as well as my naivete on processing reads. We had reads that mapped - about 45 bases after removing sequencing adapters! That's the length you get if the transposons are in beserker mode and and pile up next to each other - 45 bases is the distance between cleavage sites if two transposases are touching each other. Ooopsie - the quantitation of our DNA had clearly failed and the transposon-to-DNA ratio was too high. Bead-linked transposition in the current Nextera kits really, really helps with this - - we've run the studies ourselves - but it is a serious issue. seqWell's kits also are resistant to over-tagmentation, using two rounds of tagmentation to install the two different sequencing adapters for standard libraries.

Back to Hi-C. Hi-C was once known as Chromosome Conformation Capture sequencing, but Hi-C was less of a mouthful and evoked in us Gen-Xers memories of a sugary drink. Hi-C is also an amazing technology. In Hi-C, which has been commercialized by Phase Genomics, Arima Genomics, Dovetail as well as QIAGEN - and I won't claim that's a complete list of suppliers. Oxford Nanopore developed a protocol called Hi-C adapted for their platform. Several companies are trying to push Hi-C into the clinic - many of these kit providers plus the dedicated company Enhanc3d Genomics which had a presentation by ONT alum Dan Turner at AGBT.

Hi-C protocols start by fixing cells with a crosslinking agent - no extracted DNA please! Well, there was the "CHicAGO" technology that Dovetail was originally founded on in which chromatin is assembled in vitro and then crosslinked. Either way, you now have DNA&RNA linked to protein as a big snarl that doesn't come apart and preserves much of the structure - at least in terms of proximity - of the chromatin before fixation. Extract that chromatin, and you're ready for the next step.

The key idea in Hi-C is that if the DNA is cleaved and then re-ligated, at some appreciable frequency the ligation junctions won't be just putting ends back together that you just cleaved, but instead an end will latch onto some other neighboring DNA. After a bit more processing, such as reversing crosslinks, pulling down junctions (somewhere you've tagged them with biotin or such), adding sequencing adapters and indices and then sequencing, you have a dataset that has all sorts of interesting read pairs in it. If you map these to a reference and then compute correlations between every point X and point Y in the genome in terms of how often they are linked, you get a complex correlation map which has many layers of information in it. The strongest signal is proximity along the same DNA molecule, with roughly an exponential drop in signal as you get farther. Fainter are looping and chromosome pairing information that give three-dimensional organization information. And a lack of connections can be interesting too - Phase Genomics has shown that plasmids and phage can be linked to hosts, and multiple chromosomes linked to each other, since they will have connections in Hi-C maps derived from metagenomes whereas DNAs from different cells will not be linked.

Net result, Hi-C can give long-range information useful for scaffolding genome assemblies. And if you already "know" what that scaffold should look like and a sample gives a very different result, that suggests a large structural change such as a translocation, inversion, deletion, duplication and so forth. And generating such information doesn't require crazy amounts of coverage, though the more Hi-C coverage the smaller the perturbation you can hope to find. High coverage detects three-dimensional structure better, and the more coverage the higher the resolution - which leads researchers like Erez Lieberman Aiden to collect the 10,000X or more fold coverage of human genomes, as he presented at the Ultima Napa Valley launch event.

How to do the cleavage? In most kits and papers, the solution has been to use a cocktail of frequently-cutting restriction enzymes and digest to essentially completion - some cross-links from the fixation may prevent digestion at a small number of sites. This cocktail is potentially tunable to your organism of interest. And since it's digestion to completion(ish), no worries about titrating the amount of digestion reagent to the amount of chromatin.

The catch of course is that the pattern of restriction sites might be problematic. I took a model organism genome of interest and computed the distance between cut sites for a commercial Hi-C restriction enzyme cocktail - the red part of the histogram are fragments of 20 bases or less which will likely be unmappable. And in the context of Dovetail's intent with the new kit there's an additional issue.

Dovetail wants to have one kit which can claim finding both genetic variation and structural variation using short reads - so all those red fragments are unreadable. If a medically important variant happens to be sandwiched between two restriction sites, it can't be read reliably. Conversely, even in this ~12Mb model genome we see fragments longer than 300 bases - and so there will be regions of the genome blind to a 2x150 sequencing chemistry. All of this just gets worse in a large genome.

One alternative to restriction enzyme cocktails is to instead use a non-specific nuclease (when using micrococcal nuclease the approach is often called Micro-C). That should eliminate the problem of regions with too closely clustered restriction sites - but returns the problem of how to avoid overdigestion.

Dovetail is instead using transposases, which have limited sequence specificity. Dovetail is claiming LinkPrep libraries can be generated in one slightly-long day, though how much shorter LinkPrep's protocol is than restriction-based competition won't be clear until full protocols are released. Improvement on coverage of a human reference genome are plotted below

In terms of sample input, at the moment Dovetail is saying fresh and frozen tissue. For clinical applications being compatible with molecular pathologists great frenemy, Formalin Fixed Paraffin Embdedded (FFPE) is essential. Since fixation with formalin is the first step of Hi-C, many commercial Hi-C kits are compatible with FFPE, so Dovetail is working on this. An interesting side effect of using chromatin as input is that excess transposition - as in my tale of woe - is restrained by the presence of the chromatin proteins and crosslinks. Whereas my awful results were from transposons basically cramming in next to each other, with crosslinked chromatin much of the DNA is sequestered.

Looking at the poster, Dovetail makes several performance claims, much around using the kit with >80X coverage to detect three dimensional chromatin structure. For example, even my untutored eyes can spot the difference between the top contact map (TopoLink) and the more typical restriction enzyme schemes below it - note the greater intensity in particular of the triangles at the left side as well as the big triangle structure (the fractal-like character of these can really suck me in!) near the middle.

The below plot shows the extreme sensitivity of LinkPrep data for translocations. BCR-ABL is perhaps the best known human translocation, creating "The Philadelphia Chromosome" important in many ways to the history of oncology, genomics and pharmacology. In conjunction with Twist, Dovetail is designing hybridization capture panels for rearrangements of high medical interest; in a proof-of-concept, Dovetail shows that even tiny numbers of reads from capture can resolve BCR-ABL - 1M reads has huge separation of the signal between BCR-ABL and a negative control but realistically there's serious signal at 211K reads. So such an assay could easily be read out on an iSeq!

The plot below shows BCR-ABL again on the left, but this time from a very high coverage library that begins to reveal chromatin architecture and thereby epigenetic features. The control data isn't shown, but which chromatin loops are novel to the translocation are annotated with blue arcs. Potentially a new frontier of clinical/diagnostic sample interrogation.

What would be nice to see? Obviously, FFPE compatibility. I'd also like to see more validation data with other aberrations. For example, long ago I worked on EML4-ALK fusions in lung cancer, and those are a 12 megabase inversion and are clinically actionable. I've not worked with Hi-C, but that seems like a more challenging target than a full translocation like BCR-ABL. It would also be interesting to see the sensitivity for these in libraries without hybrid capture - hypothesis-free searches. This again points to the need in the community for a GIAB-like panel of chromosomal aberration cell lines covering many of these interesting categories.

Of course, the biggest step is full launch of LinkPrep so that independent labs can start generating data and their own comparisons. Dovetail appears to be trying to pitch LinkPrep as an alternative to true Long Reads or iCLR for clinical genome sequencing, but how well does it stack up head-to-head with those? One question is how much of the genome is lost to mapping due to having only one fragment that maps to a region rather than paired ends (actually, measuring this effect would be a cool student project that I might expand upon) - and that's atop the general troubles of short reads with mapping uniquely into repetitive data. Plus the lack of paired ends - well, losing much of it (since some fragments do just find their original partner) will hinder approaches for measuring simple repeat arrays by looking at the distribution of library fragment sizes. Dovetail didn't show any haplotyping data here - Hi-C can find correlations between same chromatid vs. sister chromatid variation and gain some phasing, which would be another place the other technologies excel. And on the other hand, Hi-C is probably better able to detect rearrangements involving highly repetitive regions

I'm going to push hard to get yet another look at clinical human genome sequencing methods out next week. There are so many approaches with a complex set of advantages and disadvantages and innovations such as LinkPrep continue to arrive at an intense pace. I hope you'll continue to join me on this Red Queen race to clinical genomics excellence.

Omics! Omics!

Sunday, May 12, 2024

DoveTail Transposes Their Hi-C Methodology

2 comments: