But, it did get me thinking. The real value of this paper is demonstrating the strengths & weaknesses (and specialized clever algorithmics) of Helicos' sequencing instrument. Helicos has struggled to claw their way into the marketplace, but has gotten a few instruments placed. The machine has some interesting properties (more on this later this week) and I'd love to get to try one out, but alas there are no service providers offering access at this time.
But to really understand the strengths and weaknesses of any platform, you really need one or more common benchmarks on which to compare. Sequencers in different stages of development may aspire to different benchmarks. For example, very early stage sequencers could use a common small DNA to target with intermediate stage sequencers shooting higher and then on to really big problems like the human genome. So, I humbly propose the following sketch of a set of standard targets which should be used for such proof-of-concept papers, going from small to big.
- M13. For very early stage sequencers. If that's still too much, just sequence out from the universal priming site. At the other end of things, it would be cool to have a set of M13 (or other clones) with various pathological sequence examples in them to build a set (or cocktail) of test samples.
- Lambda phage: 50Kb
- E.coli MG1655. ~4.5Mb
- Micrococcus luteus. Another bacterium, but with a G+C content around 80%
- Saccharomyces cerevisiae S288C. A good stepping stone, plus well studied.
- Plasmodium. Not only getting bigger (30Mb-ish), but only about 30% G+C. Being A+T rich can cause trouble just as being at the other end of the spectrum can
- Fugu rubripes. Again, a stepping stone (~0.5Gb) genome.
- Homo sapiens, NA18507. This is the HapMap sample which has been sequenced on both the SOLiD and Illumina platforms, enabling comparison of their performance. It's too bad this isn't the sample used in the Quake study, since it would allow completely direct comparison
- Corn. Even bigger than human, and representative of the biologically interesting and economically important plant genomes of huge size (10Gb on up) and hideous complexity (repetitive elements, high ploidies). Someone with experience in the field should probably specify the particular strain, though given that it is August & I used to love to grow it in the backyard, I'd suggest Silver Queen
There are, of course, some even more mongo genomes. I suspect most developers will top out with human or perhaps corn, but if you really want to go crazy there is lungfish & Fritillaria.
The above is just a suggestion. Perhaps it has too many stages -- most of the genome sequencers seem to go tiny-bacterium-human; at least that's my impression
In a similar vein, there should be some standard target regions for targeted sequencing methods to enable comparison, and again NA18507 would be an obvious choice for the genome to target those in.