Tuesday, September 01, 2009

Physical Maps: Part I of a series

A recent review/opinion paper I saw has led me to attempt to write something substantial about a topic I've generally given short shrift to: physical mapping. Given that fact, I won't be surprised if some of what I say will lead to corrections, which are always welcome, as with the recent bits on genome assembly which I posted here.

A physical map is an ordering of markers along a stretch of DNA, often with some concept of distance between the markers. The most useful physical maps link these markers to unambiguous islands of sequence. Common examples of physical maps include restriction maps (where the landmarks are sites for one or more restriction enzymes) and cytogenetic maps (where the sequence landmarks are placed in relation to the pattern of bands visible in stained DNA).

When I was in the Church lab, there just wasn't much talk about physical maps. Our goal was to do it all by sequencing. In a sense, a complete genome sequence is the ultimate physical map and we really didn't want to bother with the less useful versions. We did perhaps find curious some of the "map into the ground" genome sequencing strategies presented at conferences. If I remember correctly, one of these was going to first generate a map a BAC level, then break each BAC into cosmids & map those within the BAC, then break those into lambda clones and map them then break those up into M13-size clones (or use transposon insertions) and map those. The goal was to ultimately plan the absolute minimum number of sequencing reads to get the job done. George's group would rather just make reads too cheap to worry about such stuff, and of course in the end that viewpoint won.

The catch, as the review I saw by Stephen J. O'Brien and colleagues, is that we still aren't at the stage where we can just take some species of interest, extract the DNA, run it through the sequencers and get a finished genome. Good physical maps help assemble all the little islands of sequence into a larger whole & help find errors in the assembly, particularly when dealing with repetitive sequences. The review cites the example of platypus (discussed in a previous post and follow-up). It is a bit curious that it describes this beast as lacking a physical map, as a BAC map was generated as part of the sequencing effort. In any case, the platypus project yielded an assembly which O'Brien and colleagues deem too sketchy at long ranges to reliably make evolutionary inferences.

Now some of your level of concern for such matters depends on what scale of genetic organization you are interested in. My personal interest has often tended to be on the level of individual genes (in eukaryotes; a bunch of my thesis work looked at operon-level stuff in bacteria) but O'Brien has a long-standing interest in the evolution of genomes. If you want to understand things on that scale, you need a good map. Very detailed maps can also be used to either check for or prevent errors in sequence assembly.

There's also the question of really difficult genomes. Many valuable crops species are hexaploid or octaploid, which presumably will make shotgun assembly even more challenging than with diploid genomes. Other interesting genomesresulted from recent duplication events or genome fusions. For example, common tobacco is an evolutionarlily recent fusion of two other tobacco species.

A key question is how (or whether) to yank physical mapping into the world of next-generation sequencing. The review estimates the cost of physical mapping a mammalian genome around $100K. With the cost of getting a shotgun sequence of a similar genome quickly heading to the $1-10K region, it will be an expensive upgrade to have a physical map. It's also not where the interest (or money) seems to be going either in terms of technology development. Now, one could try to wait for promised super-long-read technologies to really show up, but that could be a while and even these may have reads in tens of kilobases, whereas some segmental duplications may be much larger. Any proposed next-gen rejiggering of a physical mapping technique would need to come in under $100K. Alas, in many cases I don't know how to estimate some of the costs but I will try to when I can.

Paired-end & mate-paired reads provide one approach to generating a map, and I think there are some protocols in which the pairs . That's a great start, but what folks like O'Brien are looking for are more on the scale of many tens or even hundreds of kilobases or even megabases.

It's worth noting likely change in physical mapping. In the past, it was considered an prerequisite for sequencing and was much cheaper. Now, a detailed physical map may be viewed as a desirable add-on, an option for improving a less-successful assembly or (as we shall see) and integrated part of the genome assembly process.

In the next installment, I'll take a side look at some general issues with using next-gen sequencing for physical map construction and the alternative approach of microarrays. The current plan is to follow that with a look at one approach called HAPPy mapping which has been proposed which elegantly attuned to the next gen world & also offer some personal variants on the approach. The entry after that will look at Radiation Hybrid maps and then I'll tackle clone (BAC & fosmid) maps and then cytogenetic maps. Along the way, I'll throw out some crazy ideas and also try to identify ways in which these strategies might leverage other trends in the field and/or provide more impetus to develop some useful capabilities. well, that's the plan right now -- but since I'm drafting about one entry ahead of the one posted, that outline is subject to change.

No comments: