Monday, August 17, 2009

A Standard Set of Test Genomes

I was away on vacation, enjoying an Internet-free existence on a tropical isle, when the brouhaha over Stephen Quake's genome sequencing paper blew up. It does appear that the paper used some very sloppy accounting & comparisons for its cost analysis, but some of the reaction to this has been a bit over-the-top (in particular, the needless throwing around of inflammatory personal descriptions).

But, it did get me thinking. The real value of this paper is demonstrating the strengths & weaknesses (and specialized clever algorithmics) of Helicos' sequencing instrument. Helicos has struggled to claw their way into the marketplace, but has gotten a few instruments placed. The machine has some interesting properties (more on this later this week) and I'd love to get to try one out, but alas there are no service providers offering access at this time.

But to really understand the strengths and weaknesses of any platform, you really need one or more common benchmarks on which to compare. Sequencers in different stages of development may aspire to different benchmarks. For example, very early stage sequencers could use a common small DNA to target with intermediate stage sequencers shooting higher and then on to really big problems like the human genome. So, I humbly propose the following sketch of a set of standard targets which should be used for such proof-of-concept papers, going from small to big.

  1. M13. For very early stage sequencers. If that's still too much, just sequence out from the universal priming site. At the other end of things, it would be cool to have a set of M13 (or other clones) with various pathological sequence examples in them to build a set (or cocktail) of test samples.

  2. Lambda phage: 50Kb

  3. E.coli MG1655. ~4.5Mb

  4. Micrococcus luteus. Another bacterium, but with a G+C content around 80%

  5. Saccharomyces cerevisiae S288C. A good stepping stone, plus well studied.

  6. Plasmodium. Not only getting bigger (30Mb-ish), but only about 30% G+C. Being A+T rich can cause trouble just as being at the other end of the spectrum can

  7. Fugu rubripes. Again, a stepping stone (~0.5Gb) genome.
  8. Homo sapiens, NA18507. This is the HapMap sample which has been sequenced on both the SOLiD and Illumina platforms, enabling comparison of their performance. It's too bad this isn't the sample used in the Quake study, since it would allow completely direct comparison

  9. Corn. Even bigger than human, and representative of the biologically interesting and economically important plant genomes of huge size (10Gb on up) and hideous complexity (repetitive elements, high ploidies). Someone with experience in the field should probably specify the particular strain, though given that it is August & I used to love to grow it in the backyard, I'd suggest Silver Queen

There are, of course, some even more mongo genomes. I suspect most developers will top out with human or perhaps corn, but if you really want to go crazy there is lungfish & Fritillaria.

The above is just a suggestion. Perhaps it has too many stages -- most of the genome sequencers seem to go tiny-bacterium-human; at least that's my impression

In a similar vein, there should be some standard target regions for targeted sequencing methods to enable comparison, and again NA18507 would be an obvious choice for the genome to target those in.

1 comment:

James said...

For proofing a sequencing technology with corn I'd imagine the best (if boring) option would be the B73 inbred line (what the published genome was done with). Many maize geneticists will start talks by pointing out the extremely high genetic diversity within maize (equivalent or greater than) between humans and chimps on both a SNP and indel basis, so while sequencing another inbred would be of great scientific benefit, I'm not sure how much the sequence could really be error checked against the reference genome.

That said, if someone is willing to sequence another maize genome, a sweet corn or popcorn genome would be excellent from a comparative genomics point of view or at least a tropical cultivar like CML333.