Thursday, September 03, 2009

Physical Maps II: Reading the signposts

How do we build physical map? In the abstract, a physical map is built by first dividing the genome up into either individual pieces or pools of pieces. These pieces need to be somewhere between large and gigantic; the size of the pieces determines the resolution of the map and its ability to span confusing or repetitive genomic regions. Ideally the pieces have a tight size distribution, but that isn't always the case. A set of known sequences is then typed against all these pools or pieces and that data fed into the appropriate algorithm to build a map.

Each physical mapping technology has its own parameters of the number of pieces or pools, whether they are pieces or pools (though of course we can always pool pieces, but it can be challenging to piece pools!), what contaminating DNA is definitively introduced by the mapping technology and other aspects. For some pool-based technologies, 100-200 pools (or more likely, 96, 192, 288 or 384) can generate a useful map. Technologies also differ in how universal they are; most are quite nearly so but some may be limited to certain neighborhoods of biological space.


If next-generation sequencing is to worm its way into this process, clearly it is at the point of reading these known sequences. However, it is also important to not just assume that next gen should worm its way in; it must be better than the alternative(s) to do that. Right now, the king of the hill is microarrays.

Several microarray platforms enable designing and building a custom array of quite high density (10's or 100's of thousands of spots easily -- and perhaps even millions) and typing them on DNA pools for perhaps $100-500 per sample. Using a two-color scheme allows two samples to be typed per array, compressing costs. Alternatively, certain other formats would allow getting long-range haplotype information but at one chip per sample. This also points out one possible reason to pick another method; if it could get substantially more information and that information is worth some additional cost.

Remember, the estimate for constructing a physical map for a mammalian species is on the order of $100K, and this is for a method (radiation hybrid) which appears to need approximately 100 pools for a decent map. So, given the costs estimated above that would be perhaps as low as $5000 ($100/sample pair) to $50,000 -- quite a range. Any next-gen approach needs to come at that price or lower for typing the pools -- unless it somehow delivers some really valuable information OR is the major route of acquiring sequence (in which case the physical map cost is folded into the sequencing cost). That's unlikely to be a wise strategy, but I'll cover that in another post dealing with a technology that is tempting in that direction.

Arrays have a bunch of other advantages. If you have contaminating DNA -- and know its sequence -- you can use that to design array probes that won't hybridize to the contaminant. In some cases that may blind the design to some important genes, but often it won't be a problem. Probes can also be chosen very deliberately; there's no element of chance here. Probes can also be chosen almost regardless of the size of a sequence island, so even very tiny islands in a draft assembly can be placed on the physical map.

Shotgun Sequencing

In contrast, another approach would be to use shotgun libraries for the mapping. Each library would represent a different pool, and to keep cost from mushrooming a high degree of multiplexing will be required. Having the ability to cheaply prepare and multiplex 10s or 100s of libraries together will be a generally useful technology. But, a straight shotgun approach has all sorts of drawbacks versus arrays.

First, any contaminant will probably be sequenced also, reducing the yield. For some mapping strategies (e.g. BACs), this might be 5-10% "overburden". But for radiation hybrid maps, the number might be more like 60%. Furthermore, repetitive sequences that can't be uniquely mapped will further reduce the amount of useful data from a mapping run.

Second, shotgun sequencing means sampling the available sequence pool, which for mapping means the probability of seeing a contig if it is present in a pool is dependent on the size of the contig. More precisely, it is dependent on the effective size of the contig, the amount of uniquely mappable sequence which can be derived from it. Since most mapping techniques use both the presence and absence of a landmark in a pool to derive information, this will mean that only contigs above a certain effective size can be confidently mapped; very small contigs will be frequently incorrectly scored as negative for a pool.

Targeted Sequencing

The other option is targeted sequencing. Again, having the ability to cheaply generate multiplex tagged targeted subsets will be very valuable. Many targeted strategies have been proposed, but I think there are two which might be useful in this context.

The first is the sorts of padlock probe (aka molecular inversion probe) design which has shown up in a number of papers. These work with exquisitely small quantities of DNA -- though (as will come up in a later post) improving their sensitivity by 1000X would be really useful. Cost of fabrication is still an issue; they can be built on microarrays but require downstream amplification. The design issues are apparently being worked out; early papers had very high failure rates (many probes were never seen in the results).

The other technology is microarray capture. However, this can't compete on cost unless a lot of libraries can go against the same array. In other words, there would be a need to pool the multiplexed libraries prior to selection -- another cool capability that would find many uses but has not yet been demonstrated.


Could any of these next-gen approaches compete with direct hybridization of labeled pools to microarrays? Given the wide (10X) swing in costs I estimated above, it's hard to be sure but I am wondering if perhaps for many mapping approaches microarray hybridization may have some longevity as the landmark reading method. However, where either the economics become very different or next-gen can extract additional information, that is where sequencing may have an edge. This is a useful way to focus thoughts as we proceed through the different mapping approaches.

No comments: