Tuesday, March 10, 2015

A Dovetail Route to Scaffolded Genomes

10X Genomics had a lot of buzz at AGBT over their approach to acquiring long range information for complex genomes via a microfluidic-assisted library preparation scheme.  Another young company, Dovetail Genomics, is starting to unveil a very different technology with similar aims.

Dovetail's approach, available in detail via a preprint, is called Chicago, and is an evolution of Hi-C methods.  Hi-C is an approach to capturing information on the linear distances between segments of DNA by cross-linking DNA fragments that are near each other in three-dimensional space as the DNA is coiled in chromatin.  This is a clever notion and has been leveraged by a number of groups, but a drawback comes from the genesis of this approach: it was originally developed to study the three-dimensional architecture of chromatin to understand how that architecture influences biology.  So traditional Hi-C ends up mixing two signals - -a signal from DNA that is close together on the same chromosome and a signal from DNA that happens to be near each other in chromatin for some biological reason.  That last category can include segments from different chromosomes, which is the last thing a scaffolding approach needs.

Dovetail's Chicago starts with several micrograms of purified DNA, so any biologically-driven associations are erased.  Chromatin is then reconstituted in vitro, followed by the Hi-C process of cross-linking DNA, digesting it with restriction enzyme, blunting the restriction ends by end-repair, ligating fragments and then freeing those fragments to go into a standard short read library preparation protocol.  The result is a collection of fragments which is enriched for mate-pair like sets that capture primarily long-range interactions.  Of course, within that set will also be some short-range fragments, as well as some fragments which span between different chromosomes.  Since purified DNA is input, only distances up to the fragment size will be captured, whereas Hi-C on native chromatin can capture extremely long-range information.  After sequencing a library (one lane on a HiSeq is suggested to be sufficient for mammalian genomes), software from Dovetail called HiRise applies a likelihood model to build scaffolds (and also break input scaffolds which appear to be incorrect). Intriguingly, any DNA will do, including DNA from prokaryotes, which aren't known to wrap their DNA in this manner.

Using this, Dovetail has shown that short read assemblies can be accurately converted to very large scaffolds.  For example, an American alligator assembly with an N50 of 81Kbp was boosted to a scaffold N50 of 10.3Mbp using a 210.7Mreads from a Chicago library.  Based on comparison to some BAC data, they estimate the misjoin frequency to be around 1 in 8.36Mbp of assembly.  The Chicago approach coupled with HiRise platform also showed promise in haplotyping a human sample, with 99.83% agreement to known phases for the sample.  Further examples were given in a recent talk at the 10K Genomes Meeting.

There a a multitude of ways to available to obtain long range information, with 10X and Dovetail being among the newest entrants.  Given long fragments, both of these methods can generate linkage information at ranges far longer than mate pair technologies, which mostly top out at 20Kb (or Moleculo down at 5Kb).  Lucigen has a clever method to generate mate pairs at roughly 40Kb spacing, but this involves making cosmid libraries.  With the current chemistry, obtaining lots of PacBio reads of around 30-40Kb is starting to be routine, though the cost for high coverage of a complex genome remains high, perhaps $40K for a mammal-class genome.  If PacBio can stick to their 2015 roadmap, that price might be around $10K.  Oxford Nanopore reads of greater than 40K have now been published, but generating such long reads routinely remains to be seen.  In theory, nanopore sequencing could sequence whatever length of DNA is given to it, so long as that DNA is free of strand breaks.

Demonstrated technologies for very long distance information include mapping methods from OpGen, BioNano Genomics and Nabsys.  However, rather than sequencing DNA these methods identify the spacing of defined sequence landmarks, such as restriction sites (for OpGen and BioNano).  This works well with a reference sequence, can be used for de novo assembly of organisms.

10X and Dovetail are trying to tackle a similar space as those mapping companies, but via leveraging an Illumina sequencer.  10X uses an expensive box and fancy microfluidics; Dovetail just some molecular biology steps.  10X resolves sequencing reads to specific input fragments, which may be particularly valuable for haplotyping or for metagenomic applications.  10X also has an advantage for input material, requiring only 1 nanograms versus multiple micrograms for Dovetail.  Whether that is important or not depends on your input material: for many cancer clinical samples micrograms can be impossible to obtain, but for de novo sequencing of a creature or plant it may no burden to obtain micrograms.

De novo sequencers may simply use both.  At the 10K Genomes Meeting, The Broad Institute's David Jaffe suggested that about $16K is sufficient to obtain a high quality vertebrate genome, using standard short input paired end libraries, 10X and Dovetail to resolve long range structure.  Perhaps between that sort of approach and the growing number of long-read-only assemblies, maybe the acceptability of publishing low quality draft genomes will plummet.  I tried to predict that last year; maybe this trend will be truly visible this year.

(Image credit: "Finished dovetail". Licensed under CC BY-SA 3.0 via Wikimedia Commons)


Alejandro Fernandez Woodbridge said...

Very interesting, I'm doing my Phd on 3C-based techniques and I've never been a big fan of HiC because only a very small fraction of the reads map to long distances. It is quite a twist to see that the HiC's greatest limitation as a strength in a different protocol. One thing that is not clear for me is that HiC and Chia-Pet have huge random ligation rates (Chia-Pet 40-100%) mainly because at high input it is very expensive to dilute the mix to avoid spurious ligation, I wonder how they managed to balance this...

It would be really nice if I could post this on LinkedIn, why don't you have a link to that?

Keith Robison said...

Alejandro: if you could point me to an example of the sort of LinkedIn linkage you are thinking of, I could certainly consider that - hadn't occurred to me previously. I did find an IFTT recipe to do this to my LI profile, but perhaps you've seen something broader than that. Thanks for the real experience feedback on these methods.

David Mead said...

A new method for producing long mate pair reads approaching 100 kb is another alternative (http://lucigen.com/docs/literature/eLucidations/AppNote-Long-Span-NGS-Mate-Pairs-Approaching-100-kb-2015.pdf). Full disclosure, I work at Lucigen.