The
sun set during AGBT 2014 for a final time over a week ago. The posters have long been down, and perhaps
the liver enzyme levels of the attendees are now down to normal as well. This year’s conference underscored a
possibility that was suggested last year: that the era of the poorly connected,
low quality draft genome is headed for the sunset as well
Early
complete genomes were truly complete, using a mix of libraries and other
strategies to gain truly closed genomes.
These were monumental achievements, and other than some minor sniping, they were heralded as marvelous
achievements. But these were hard and
expensive, and certainly a lot of biological value could be extracted from
lower quality genomes, though they also laid traps for careless workers to
blunder and declare artifacts as discoveries. As short read sequencing enable fast and
cheap genome sequences, the qualities dipped lower and lower as the cost and
labor gap between draft and quality genomes became enormous. Strategies such as mate-pair libraries tried
to bridge the gap, but tended to be disappointing and fall short of the goal of driving genomes to completion.
Two recent genomes illustrate what has been commonplace. The duckweed genome is 158Mb, or about the same size as Drosophila, and was assembled using Illumina paired end reads and 454 mate pairs plus BAC end sequencing. The mean contig size after this was only 8.2Kb (contig N50 was not given) and 1071 scaffolds. The flatfish genome comes in over 400Mb, was assembled using a gemisch of read types, and attained a contig N50 of 26.5Kb and scaffold N50 of 868Kb. That is, assuming this all came out correctly; a major revision of the Aedes aegypti genome assembly was just announced that made many edits to the scaffolds.
What
has radically changed in the last year and change is the emergence of true long
read sequencing, spearheaded by Pacific Biosciences. Bolstered by clever software, long reads democratized
high quality microbial sequencing. With
the newest chemistries and RS II instruments, a single flowcell is sufficient
to obtain a completely closed genome of many bacteria of interest for less than
$1K. While short read sequencing can
still deliver some sort of assembly for significantly less, the distance in
price has dropped to a few fold and several hundred dollars.
At
AGBT, the building momentum for larger genomes was demonstrated by William
McCombie (CSHL) and others. McCombie
showed assemblies of S.cerevisiae and S.pombe which resolved nearly all
chromosomes to single contigs; a few were represented by two contigs, one for
each arm. Arabidopsis and Drosophila
have shown very impressive results, and his groups’ results on the over half gigabase
rice genome show that high continuity genomes are possible even in large eukaryotic
genomes. Coupled with Jason Chin showing
off a diploid assembler for long reads , Gene Myers announcing a much faster
read cleaning pipeline and a Japanese group presenting a poster on a read cleaning pipeline that works with less coverage, the future is very rosy for long
read assembly. Moore’s Law helps the
sequencing world here too: genome sizes are not doubling every year! So the hardware to process these will
continue to get cheaper.
Coming
down from above, Josh Burton discussed his Lachesis software, which can use
Hi-C data to generate chromosome arm (or whole chromosome) scaffolds. Hi-C is a library preparation method from
which the read pairs indicate regions of DNA that were physically proximate in
the eukaryotic nucleus. The signal is a
bit faint, but with scaffolds of 50kb or greater Lachesis shows a strong
ability to detect that signal and organize the data.
There
is still a cost difference, which can be extreme for larger genomes. But, as PacBio continues to improve their
chemistry and preparation protocols that will shrink. Right now, any genome under about 6Mb can be
reasonably expected to be closed for under $1K on PacBio, though this is
dependent on the quality of the DNA preparation and care must be taken or small plasmids can be lost. PacBio library prep is more sensitive than
short read preps to various contaminants, but this problem appears to be
limited in impact. PacBio has announced
an intention to improve throughput by 4X this year, and while that is a
projection they did meet these goals last year.
So, by the end of the year it is not unreasonable to think that any
bacterium and many smaller eukaryotic genomes can be had for a single SMRT
cell, which is around $600 with library prep, and $10K would be enough to cover
genomes up in the 150Mb range or more. Using high-coverage Illumina data to reduce the coverage requirements is another option, with a number of PacBio-with-Illumina read cleaning pipelines available. This is all without mate pair libraries, sequencing the ends of cosmids or BACs or optical mapping. It will also be a relief for everyone who can't remember how to calculate N50s; when your contigs or scaffolds are chromosome-scale, the N50 statistic is meaningless!
There
is also the very real possibility of other long read technologies getting into
the game. Illumina’s Moleculo technology
has been used for eukaryotic assembly, though it has not been put head-to-head
with PacBio and the nature of some repeats and PCR-unfriendly regions would make
it unsurprising if it were not quite as good.
But Moleculo may be much cheaper for large genomes, though I don’t
believe the values to calculate the cost tradeoffs are widely available. Oxford Nanopore data was seen for the first
time at AGBT, and while it was a bit of a lackluster debut, the ability to resolve
repeats with this data was demonstrated.
With Oxford’s MinION access program scheduled to get devices to users
next month, and a permissive policy for data release on users’ samples, the sequencing
world should soon be awash in MinION data (several Tweeters have already pledged
release of their data).
So genome
assembly remains in flux, but long reads are taking over and will simply get
better and cheaper. Given this, isn’t it
time to start making the push to make poor quality draft genomes less respectable? While there will remain hard cases, such as
metagenomic samples and single cell genomes or when surveying many related isolates, the time would seem nigh to make
high quality references the standard. For
prokaryotes, it really is reasonable now to expect closed genomes as typical. For small eukaryotic genomes, say under 50Mb,
complete chromosome arms should be the standard, and for larger genomes that
should still be a goal. For larger eukaryotes, scaffolding with Lachesis and Hi-C
libraries should be expected.
Such a change won't happen overnight, but it needs to start happening. Journal editors, reviewers, lab heads, staff scientists, post-docs, graduate students and everyone else who wants to lead the charge should do so. Dispelling the current complacency with low quality drafts requires active effort. But the rewards are substantial, so it is a crusade worth joining!
Jonathon Eisen Storified the Twitter commentary on this post.
ReplyDeleteI can't wait. Currently I am working on a ~65Kb genome which despite Illumina paired-end and mate-paired reads is refusing to close up. We may need more sequencing done but the PI is being cheap. :-( Anyway our first attempt with just PEs was rejected by a low-tier journal because the genome was not complete enough. Despite the added work I silently applauded. There are too many draft, or sub-draft, genomes out there.
ReplyDelete