Thursday, March 30, 2017

Chromosome-Scale Scaffolds And The State of Genome Assembly

A new paper on using Hi-C sequencing appeared in Science recently, demonstrating the generation of chromosome-length scaffolds for human as well as several insect genomes.  The authors even provide a cost model, proposing that by processing multiple genomes in parallel the sequencing reagent cost (but not labor) of this approach should be about $10K per human genome. In the case of the insect genomes, the paper enables a look at chromosome evolution which is simply impossible with lower resolution.  These findings resonate with a number of pieces I've written over the years, but particularly with my recent criticism of the proposal Earth BioGenome project and a spirited defense of that concept made in the comments of my piece by a member of the steering committee.
Hi-C sequencing approaches are perhaps the last useful form of mate pair sequencing, though they are very different than conventional mate pair sequencing.  Mate pair sequences, and their early genome project predecessors jumping libraries, deliberately create chimaeric DNA fragments in which the goal is to create a short fragment derived from two pieces of a longer fragment.  In jumping libraries and conventional mate pairs, genomic DNA is fragmented in some way and then circularlized.  The junctions formed by the circularization are fragmented out again and sequenced.  Depending on the precise strategy and conditions, the final library should have a large fraction of appropriate junctions plus some level of background of non-mate fragments from the second fragmentation.  Importantly, for correct mates the distance between the fragments should match the size distribution of the first fragmentation.

Hi-C sequencing is far more statistical.  Chemical crosslinking of DNA is used to link fragments which are close in space; since fragments close in space are likely to come from the same DNA molecule.  Hi-C can be performed on native chromatin, as in this paper, or by reconstituting chromatin in vitro as Dovetail's Chicago method does.  The Science paper has a nice little summary of the typical distribution of fragments based on the binning scheme used in their algorithm
  • 10%  <10kb li="" nbsp="">
  • 15%  10kb-100kb
  • 18%  100kb-1Mb
  • 13%  1Mb-10Mb
  • 16%  10Mb-100Mb
  •   2%  >100Mb on same chromosome
  • 21%  cross-chromosome links
So a bit of a mess, but with some really long-range information. Hi-C libraries obtain crazy physical coverage; this paper puts it at 23,000X coverage!  So using their own algorithm, which the authors benchmark against the older Lachesis software, Hi-C was capable of organizing the original assembly into chromosome-scale assemblies.  Not only did their pipeline build the original data into superscaffolds, but it also cleaned up the original assemblies by identifying misjoins and missed joins.  Not all of the input assembly participates; contigs smaller than 15kb accumulate too few Hi-C mappings to be useful.  

The authors included N50 statistics for the final assemblies, though as they note "The scaffold N50 for the output assemblies is not a particularly meaningful assembly statistic: It is determined almost entirely by the chromosome length scaffolds".  Indeed, I've made a similar point twice recently on Twitter: as assemblies get really good, N50 loses meaning as it is dominated by the size distribution of chromosomes or chromosome arms.

The approach was applied to three genomes: human and the mosquito disease vectors Aedes aegyptii and Culex quinquefasciatus.  In all cases, the number of final large scaffolds is equal to the known number of chromosomes.  An interesting biological result that drops out of this is that the two mosquitoes share the same architecture for chromosome 3 (they have only 3 chromosomes) .  Strikingly, even though A.aegyptii has a greatly expanded 1.3Gb genome vs. Anopheles gambiae genome (yet another mosquito), the genetic content of the chromosomes is essentially preserved.  A significant exception is the detection of a single breakage after the divergence from Anopheles of the lineage that led to Aedes and Culex and an exchange of one pair of chromosome arms between A.aegyptii and C.quinquefasciatus.  Furthermore, the chromosome arms of the fruit fly Drosophila melanogaster map neatly with these mosquito arms, showing that flies (dipterans) have strongly conserved chromosome content

This sort of high level chromosome organizational and evolutionary information just can't be found in highly fragmented assemblies.  Hence, I reiterate my opinion that trying to sequence every last eukaryotic species is far less valuable biologically than generating high quality assemblies of carefully selected benchmark species.  Generating draft genomes of every last dipteran is far less likely to be biologically instructive than generating chromosome-arm-resolved genomes of each arthropod family.  Given the expense required to ethically and definitely obtain high quality biological material, investing in generating high quality genomes for a careful sampling of species diversity will (in my opinion) be much more enlightening than crude gene catalogs for orders of magnitude more species.

Note that these mosquito genomes are not anything like platinum genomes, but were scaffolded from apparently old assemblies (the human assembly was fresh 2x250 Illumina data).  It will be interesting to see how this plays out.  For example, the ability of this pipeline to detect assembly errors and missed joins go towards irrelevancy if the input assemblies were much higher quality.  For example, the assemblies used had contig N50s of 103Kb, 83Kb and 29Kb for human, Aedes and Culex respectively.  In contrast, the recent nanopore assembly of a human genome had a contig N50 approaching 3Mb and PacBio hit 4Mb N50 three years ago for human and has continued to push the envelope on assembly continguity.  Back in fall 2015, PacBio hit a contig N50 of 26.9Mb on a haploid human sample, with half the genome in the 30 largest contigs.  The image below is from PacBio's press release on this, showing chromosomes 2 and 6 as a small number of contigs.

Better input assemblies solve a number of issues.  First, there's those short contigs the Science authors filtered out, constituting about 5% of the input genome.  There's also a large number of "short" and "tiny" scaffolds left over by the process; presumably some of these represent contaminants and other schmutz, but some may be unplaced real information.  Unfortunately, no debugging work appears to have been performed on these.  The final Aedes assembly also misplaces (or the authors suggest, fails to have corrected)  4 known linkage markers out of 1826 tested, which is very good but frustrating if your area of interest is tagged with one of those markers. The Culex map has only one misplacement, but a much smaller number of input markers.

So that opens up the question of how much boost Hi-C could give to really good assemblies and whether that improvement is worth the cost.  In other words, if it costs $10K to improve your human genome from 50% of the assembly in 30 contigs to 99% of the data in 23 scaffolds, is that better than spending that money to sequence another human (or other mammalian) genome to the 50% in 30 range?

The State of Genome Assembly, March 2017

For high quality de novo genomes, the technology options appear to be converging for the moment on five basic technologies which can be mixed-and-matched. 
  • Hi-C (in vitro or in vivo)
  • Rapid Physical Maps (BioNano Genomics)
  • Linked Reads (10X, iGenomX)
  • Oxford Nanopore
  • Pacific Biosciences
  • vanilla Illumina paired end
Please correct any inadvertent omissions, but I'll note some deliberate ones.  I have never been a fan of mate pair approaches, mostly because they've never worked well on my problems.  Even if that is just bad luck for me, it is hard to see the logic of 20kb mate pair libraries in a world of abundant long reads much longer than that.  I've left Nabsys off the mapping space as I haven't seen them working on de novo, but perhaps I've missed that.  I've seen people clean up long read data with Ion Torrent, but always seems odd given the cost and accuracy advantage of Illumina

If you work in prokaryote space, then the solution space is really Pacific Biosciences as the clear leader, with the option to polish with Illumina reads, and Oxford Nanopore as the upstart.  PacBio has the edge right now due to being able to go solo; Oxford's sequence quality demands Illumina polishing.

For small eukaryotic genomes, similar rules apply.  Oxford has the advantage that with the recent "leviathan" reads approaching a megabase it is possible to have centromere-spanning and even chromosome-length reads.  For example, in Saccharomyces cerevisiae 10 of 16 chromosomes are shorter than the longest reported mappable MinION read.  At the moment leviathans are rare, which gives an opening for BioNano, but this area is a hotbed right now so don't be surprised to see rapid improvement in the length and yield of monster reads. 

But in many eukaryotic organisms, at least some centromeres become too long to span with MinION leviathans or BioNano maps.  For example, human centromeres range from 0.3Mb to 5Mb in size, so the lower range could be tackled with super-long reads but the high end would still be out-of-reach.  Whether stopping at chromosome arms is only disappointing or a serious problem would depend on the organism and project goals, but certainly for poorly-studied organisms in the more remote reaches of the tree of life it could be a significant handicap.  But in any case, it really should be possible to generate high-quality eukaryotic genomes for perhaps $40K, with $10K or less going for long read sequencing and $10K for Hi-C and another $20K for labor (and I have no idea how to estimate compute costs for Hi-C).  The labor is a total guess, though I think the going estimate for fully loaded personnel in Boston area is $400K or about $2K/day (assuming 200 work days a year, which isn't far off the mark for this exercise), so that's saying 10 days alloted human effort to extract DNA, perform Hi-C and basic informatics.

A related problem in eukaryotes is resolving haplotypes, a particularly great challenge in highly polyploid organisms.  The Science paper does not generate haplotype-resolved versions of the scaffolds.  My guess is that this isn't easy to do with Hi-C data and is best left to linked reads or long reads, but perhaps someone who works in the space can weigh in.

In any case, if someone were crazy enough to put me in charge of managing a large zoo-seq project, I'd start everything off with just long reads and treat linked reads, physical mapping and Hi-C as optional add-ons to be applied if the initial assembly didn't fully gel or if haplotyping were critical.  Linked reads might also be deployed to less expensively ascertain population variation with haplotype information.  Linked reads are well established with small amounts of input DNA, though the long read vendors are constantly pushing on that front.

The field will continue to evolve at a dizzying pace.  Existing players are all trying to up their game, delivering higher performance.  Accessory technologies will continue to develop, such as methods and instruments for extracting extremely high molecular weight DNA to go into long read sequencing and physical mapping workflows. New entrants are poised to enter the fray, such as new linked read options or long read options.  Never a boring moment in this space!


Anonymous said...

Hey Keith,

Thank you so much for writing such a thoughtful post about our paper!

I did want to clarify one point. In a few places you say that the Hi-C sequencing costs $10K. Actually, the $10K value we report is the total amount for a de novo human genome, including both the contigs (which are the lion's share of the cost: 60X coverage, Illumina PE 250bp) and the Hi-C data used for scaffolding (<7X Illumina PE101bp). The Hi-C cost alone - including library prep and labor - is less than $1K.

Thus the contigs are the expensive part, and when contigs are already available (whether generated using Illumina, PacBio, or any other method), the costs can be much lower! In the case of the Aedes genome, for which contigs were available from the 2007 Science publication (Nene et al.), we used only $1K worth of Hi-C sequence for the scaffolding. Similarly, Culex was also about $1K worth of sequencing.

These costs are broken down in supplemental table S12. They can vary based on factors like contig length, genome length, and genome repeat structure, and will drop as the cost of short reads continues to fall.


Anonymous said...

Hi Olga,

these costs can only be matched if you're in academia, have extensive lab knowledge and don't account for the time (and salary) a bioinformatician has to spend performing the analysis. Outside academia and when you actually have to pay the full cost for DNA extraction, sequencing, HiC scaffolding etc. it's about 4x as much.