Tuesday, September 08, 2009

Next-generation Physical Maps III: HAPPy Maps

A second paper which triggered my current physical map madness is a piece (open access!) arguing for the adaptation of HAPPY mapping to next-gen sequencing. This is intriguing in part because I see (and have) a need for cheap & facile access to the underlying technologies but also because I think there are some interesting computational problems (not touched on the paper, as I will elaborate below) and some additional uses to the general approach.

HAPPy mapping is a method developed by Simon Dear for physical mapping. The basic notion is that large DNA fragments (the size range determining several important parameters of the map; the maximum range between two markers is about 10 times the minimum resolution) are randomly gathered in pools which each contain approximately one half a genome equivalent. Practical issues limit the maximum size fragments to about 1Mb; any larger and they can't be handled in vitro. By typing markers across these pools, a map can be generated. If two markers are on different chromosomes or are farther apart than the DNA fragment size, then there will be no correlation between them. On the other hand, two markers which are very close together on a chromosome will tend to show up together in a pool. Traditionally, HAPPy pools have been typed by PCR assays designed to known sequences. One beauty of HAPPy mapping is that it is nearly universal; if you can extract high molecular weight DNA from an organism then a HAPPy map should be possible.

The next-gen version of this proposed by the authors would make HAPPy pools as before but then type them by sequence sampling the pools. Given that a HAPPy pool contains many orders of magnitude less DNA than current next-gen library protocols require, they propose using whole-genome amplification to boost the DNA. Then each pool would be converted to a bar-coded sequencing library. The final typing would be performed by incorporating these reads into a shotgun assembly and then scoring each contig as present or absent in a pool. Elegant!

When would this mapping occur? One suggestion is to first generate a rough assembly using standard shotgun sequencing, as this improves the estimate of the genome size which in turn enables the HAPPy pools to be optimally constructed so that any given fragment will be in 50% of the pools. Alternatively, if a good estimate of the genome size is known the HAPPy pools could potentially be the source of all of the shotgun data (this is hinted at).

One possible variation to this approach would be to replace bar-coded libraries and WGA with Helicos sequencing, which can theoretically work on very small amounts of DNA. Fragmenting such tiny amounts would be one challenge to be overcome, and of course the Helicos generates much shorter, lower-quality reads than the other platforms. But, since these reads are primarily for building a physical map (or, in sequence terms driving to larger supercontigs), that may not be fatal.

If going with one of the other next-gen platforms (as noted in the previous post in this series, perhaps microarrays make sense as a readout), there is the question of input DNA. For example, mammalian genomes range in size from 1.73pg to 8.40pg. A lot of next-gen library protocols seem to call for more like 1-10ug of DNA, or about 6 logs more. The HAPPy paper's authors suggest whole-genome amplification, which is reasonable but could potentially introduce bias. In particular, it could be problematic to allow reads from amplified DNA to be the primary or even a major source of reads for the assembly. As I've noted before, other approaches such as molecular inversion probes might be useful for low amounts, but have not been demonstrated to my knowledge with picograms of input DNA. However, today I stumbled on two papers, one from the Max Planck Institute and one from Stanford, which use digital PCR to quantitate next-gen libraries and assert that this can assist in successfully preparing libraries from tiny amounts of DNA. It may also be possible to deal with this issue by attempting more than the required number of libraries, determining which built successfully by digital PCR and then pooling a sufficient number of successful libraries.

The desirable number of HAPPY libraries and the desired sequencing depth for each library are two topics not covered well in the paper, which is unfortunate. The number of libraries presumably affects both resolution and confidence in the map. Pretty much the entire coverage of this is the tail of one paragraph
In the Illumina/Solexa system, DNA can be randomly sheared and amplified with primers that contain a 3 bp barcode. Using current instruments, reagents, and protocols, one Solexa "lane" generates ~120 Mb in ~3 million reads of ~40 bp. When each Solexa lane is multiplexed with 12 barcodes, for example, it will provide on average, ~10 Mb of sequence in ~250,000 reads for each sample. At this level of multiplexing, one Solexa instrument "run" (7 lanes plus control) would allow tag sequencing of 84 HAPPY samples. This means, one can finish 192 HAPPY samples in a maximum of three runs. New-generation sequencing combined with the barcode technique will produce innumerous amounts of sequences for assembly.

Changing the marker system from direct testing of sequence-tagged sites by PCR to sequencing-based sampling has an important implication as discussed in the last post. If your PCR is working well, then if a pool contains a target it will come up positive. But with the sequencing, there is a very real chance of not detecting a marker present in a pool. This probability will depend on the size of the target -- very large contigs will have very little chance of being missed, but as contigs go smaller their probability of being missed goes up. Furthermore, actual size won't be as important as the effective size: the amount of sequence which can be reliably aligned. In other words, two contigs might be the same length, but if one has a higher repeat count that contig will be more easily detectable. These parameters in turn can be estimated from the actual data.

The actual size of the pool is a critical parameter as well. So, the sampling depth (for a given haploid genome size) will determine

In any case, the problem of false negatives must be addressed. One approach is to only map contigs which are unlikely to have ever been missed. However, that means losing the ability to map smaller contigs. Presumably there are clever computational approaches to either impute missing data or simply deal with it.

It should also be noted that HAPPy maps, like many physical mapping techniques, are likely to yield long-range haplotype information. Hence, even after sequencing one individual the approach will retain utility. Indeed, this seems to be the tack that Complete Genomics is taking to obtain this information for human genomes, though they call it Long Fragment Reads. It is worth noting that the haplotyping application has one clear difference from straight HAPPy mapping. In HAPPy mapping, the optimal pool size is one in which any given genome fragment is expected to appear in half the pools, which means pools of about 0.7X genome. But for haplotyping (and for trying to count copy numbers and similar structural issues), it is desirable to have the pools much smaller, as this information can only be obtained if a given region of the genome is haploid in that pool. Ideally, this would mean each fragment in its own pool (library), but realistically this will mean as small a pool size as one can make and still cover the whole genome in the targeted number of pools. Genomes of higher ploidies, such as many crops which are tetraploid, hexaploid or even octaploid, would probably require more pools with lower genomic fractions in order to resolve haplotypes.

In conclusion, HAPPy mapping comes close to my personal ideal of a purely in vitro mapping system which looks like a clever means of preparing next-gen libraries. The minimum and maximum distances resolvable are about 10-fold apart, so more than one set of HAPPy libraries is likely to be desirable for an organism. Typically this is two sizes, since the maximum fragment size is around 1Mb (and may be smaller from organisms with difficult to extract DNA). A key problem to resolve is that HAPPy pools contain single digit picograms of DNA. Amplification is a potential solution but may introduce bias; clever library preparation (or screening) may be another approach. An open problem is the best depth of coverage of the multiplexed HAPPy next-gen libraries. HAPPy can be used both for physical mapping and long-range haplotyping, though the fraction of genome in a pool will differ for these different applications.

ResearchBlogging.orgJiang Z, Rokhsar DS, & Harland RM (2009). Old can be new again: HAPPY whole genome sequencing, mapping and assembly. International journal of biological sciences, 5 (4), 298-303 PMID: 19381348

No comments: