I've been completely slacking on completing my self-imposed series on how second generation sequencing (I'm finally trying to kick the "next gen" term) might reshape the physical mapping of genomes. It hasn't been that my brain has been ignoring the topic, but somehow I've not extracted the thoughts through my fingertips. And I've figured out part of the reason for my reticence -- my next installment was supposed to cover BACs and other clone-based maps, and I'm increasingly thinking these aren't going to be around much longer.
Amongst the many ideas I turned over was how to adapt BACs to the second generation world. BACs are very large segments -- often a few hundred kilobases -- cloned into low copy (generally single copy) vectors in E.coli.
One approach would be to simple sequence the BACs. One key challenge is that a single BAC is poorly matched to a second generation sequencer; even a single lane of a sequencer is gross overkill. So good high-throughput multiplex library methods are needed. Even so, there will be a pretty constant tax of resequencing the BAC vector and the inevitable contaminating host DNA in the prep. That's probably going to run about 10% wastage -- not unbearable but certainly not pretty.
Another type of approach is end-sequencing. for this you really need long reads, so 454 is probably the only second generation machine suitable. But, you need to smash down the BAC clone to something suitable for emulsion PCR. I did see something in Biotechniques on a vectorette PCR to accomplish this, so it may be a semi-solved problem.
A complementary approach is to landmark the BACs, that is to identify a set of distinctive features which can be used to determine which BACs overlap. At the Providence conference one of the posters discussed getting 454 reads from defined restriction sites within a BAC.
But, any of these approaches still require picking the individual BACs and prepping DNA from them and performing these reactions. While converting to 454 might reduce the bill for the sequence generation, all that picking & prepping is still going to be expensive.
BACs baby cousins are fosmids, which are essentially the same vector concept but designed to be packaged into lambda phage. Fosmids carry approximately 40Kb of DNA. I've already seen ads from Roche/454 claiming that their 20Kb mate pair libraries obviate the need for fosmids. While 20Kb is only half the span, many issues that fosmids solve are short enough to be fixed by a 20Kb span, and the 454 approach enables getting lots of them.
This is all well and good, but perhaps its time to look just a little bit further ahead. Third generation technologies are getting close to reality (those who have early access Pacific Biosciences machines might claim they are reality). Some of the nanopore systems detailed in Rhode Island are clearly far away from being able to generate sequences you would believe. However, physical mapping is a much less demanding application than trying to generate a consensus sequence or identify variants. Plenty of times in my early career it was possible using BLAST to take amazingly awful EST sequences and successfully map them against known cDNAs.
Now, I don't have any inside information on any third generation systems. But, I'm pretty sure I saw a claim that Pacific Biosciences has gotten reads close to 20Kb. Now, this could have been a "magic read" where all the stars were aligned. But imagine for a moment if this technology can routinely hit such lengths (or even longer) -- albeit with quality that makes it unusable for true sequencing but sufficient for aligning to islands of sequence in a genome assembly. If such a technology could generate sufficient numbers of such reads in reasonable time, the 454 20Kb paired libraries could start looking like buggy whips.
Taking this logic even further, suppose one of the nanopore technologies could really scan very long DNAs, perhaps 100Kb or more. Perhaps the quality is terrible, but again, as long as its just good enough. For example, suppose the error rate was 15%, or a phred 8 score. AWFUL! But, in a sequence of 10,000 (standing for the size of a fair-sized sequence island in an assembly) you'd expect to find nearly 3 runs of 50 correct bases. Clearly some clever algorithmics would be required (especially since with nanopores you don't know which direction the DNA is traversing the pore), but this would suggest that some pretty rotten sequencing could be used to order sequence islands along long reads.
Yet another variant on this line of thinking would be to use nanopores to read defined sequence landmarks from very long fragments. Once you have an initial assembly, a set of unique sequences can be selected for synthesis on microarrays. While PCR is required to amplify those oligos, it also offers an opportunity to subdivide the huge pool. Furthermore, with sufficiently long oligos on the chip one could even have multiple universal primer targets per oligo, enabling a given landmark to be easily placed in multiple orthogonal pools. With an optical nanopore reading strategy, 4 or more color-coded pools could be hybridized simultaneously and read. Multiple colors might be used for more elaborate coding of sequence islands -- i.e. one island might be encoded with a series of flashing lights, much like some lighthouses. Again, clever algorithmics would be needed to design such probe strategies.
How far away would such ideas be? Someone more knowledgeable about the particular technologies could guess better than I could. But, it would certainly be worth exploring, at least on paper, for anyone wanting to show that nanopores are close to prime time. While really low quality reads or just landmarking molecules might not seem exciting, it would offer a chance to get the technology into routine operation -- and from such routine operation comes continuous improvement. In other words, the way to push nanopores into routine sequencing might be by carefully picking something other than sequence -- but making sure that it is a path to sequencing and not a detour.