Monday, June 29, 2009

Lox: The last genome for electrophoretic Sanger?

Amongst the news last week is a bit of a surprise: the salmon genome project is choosing Sanger sequencing for the first phase of the project. Alas, one needs a premium subscription to In Sequence, which I lack, so I can't read the full article. But, the group has published (open access) a pilot study on some BACs, which concluded that 454 sequencing couldn't resolve a bunch of the sequence, and so shorter read technologies are presumably ruled out as well. A goal of the project is a high quality reference sequence to serve as a benchmark for related fish, demanding very high quality.

This announcement is a jolt for anyone who has concluded that Sanger has been largely put to pasture, confined to niches such as verifying clones and low-throughput projects. Despite the gaudy throughput of the next-gen sequencers, read length remains a problem. However, that hasn't stopped de novo assembly projects such as panda from apparently proceeding forward. Apparently salmon is even nastier when it comes to repeats.

Still playing the armchair next-gen sequencer (for the moment!), it is an interesting gedanken experiment. Suppose you had a rough genome you really, really wanted to sequence and get a high-quality reference sequence. On the one hand, Sanger sequencing is very well proven. However, it is also more expensive per base than the newer technologies. Furthermore, Sanger is pretty much a mature technology, with little investment in further improvement. This is in contrast to next gen platforms, which are being pushed harder and harder both by the manufacturers as well as the more adventurous users. This includes novel sequencing protocols to address difficult DNA, such as the recently published Long March technique (which I'm still fully wrapping my head around) that generates nested libraries for next-gen sequencing using a serial Type IIS digestion scheme. Complete Genomics has some trick for inserting multiple priming sites per circular DNA template. Plus, Pacific Biosciences has demonstrated really long reads in a next gen platform -- but demonstrating is different than having it in production.

So it boils down to the key question: do you spend your resources on the tried-and-true, but potentially pricey approach or try to bet that emerging techniques and technologies can deliver the goods soon enough. Put another way, how critical is a high quality reference sequence? Perhaps it would be better to generate very piecemeal drafts of multiple species now and then go for finishing the genomes when the new technologies come on line. But what experiments dependent on that high quality reference would be put off a few years? And what if the new technologies don't deliver, in which case you must fall back on Sanger and be quite a bit behind schedule.

It's not an easy call. Will salmon be the last Sanger genome? It all depends on whether the new approaches and platforms can really deliver -- and someone is daring enough to try them on a really challenging genome.


Rick said...

As you say, it will take someone brave enough to do de-novo genome assembly with 2nd gen technologies to prove that it can be done. Part of the article you cite goes on to say, "The ICSASG will likely also pay close attention to the cod genome project, a collaboration ...that was established last year to sequence the cod genome de novo, using largely 454 data. "

Since you like Pevzner, if you have not done so take a look at his Dec. 2008 article in Genome Research entitled "De Novo fragment assembly with short mate-paired reads: Does read length matter?" He talks about a theoretical read length barrier. For yeast this seems to be 60 bases; i.e., increasing mate paired reads past 60 bp does not help the assembly. For more complex organisms the barrier is not defined but it could very well below the length of 454 reads.

Steve said...

If you read the next generation sequencing papers carefully you realize that these technologies are not yet ready to replace Sanger for new genomes. For example, Wang et al. 2009 (PMID 18987735, "The diploid genome sequence of an Asian individual") reports a five-fold excess of deletions over insertions relative to the reference genome, which means that 80% of novel sequence was probably missed. That is for a genome (humans) with a very high-quality reference genome and relatively little variation between individuals. It would be much worse for a novel genome.

Of course, these problems will be solved by proper attention to assembly and further improvement in the methods themselves. My concern is that the period between now and when the technology improves further will be characterized by an outpouring of bad genomes.

Kudos to the ICSASG for doing it right.

Keith Robison said...

Thank you both for your instructive comments. I've been meaning to read the Pevzner paper, but somehow have forgotten to download it on three trips to the MIT library (or perhaps I should double-check my USB key for it!)

With regard to the small indels vs. the human reference, I would agree that is a topic worth further investigation. However, despite all the effort put into the human reference I'd rather not just discount them wholesale but see them semi-systematically investigated. The recent publication by the SOLiD group validated many SNPs and some large structural variants by other means, but unfortunately did not do so for small indels.

It appears the Korean sequence (GAII) has a lot of indels vs other sequences as well, but they did no validation work.

I wouldn't characterize these as necessarily "bad" genomes; they may just have some systematic issues. Still, there is an awful lot of useful stuff you can do with a lower quality genome you can't do with a completely unsequenced genome. Depending on what the short-term experimental outlook is, I'd probably generally lean towards having a lot of quick-and-dirty genomes than one really polished one, given that getting a really polished one might become much cheaper in the very near future. But, that's definitely a judgement call.