I find myself worrying sometimes that I worry too much about the words I use -- and worry some of the rest of the time that I don't worry enough. What can seem like the right words at one time might seem wrong some other time. The terms "killer app" are thrown around a lot in the tech space, but would you really want to hear it used about sequencing a genome if you were the patient whose DNA was under scrutiny?
One term that sees a lot of traction these days is "exome sequencing". I listened in on a free Science magazine webinar today on the topic, and the presentations were all worthwhile. The focus was on the Nimblegen capture technology (Roche/Nimblegen/454 sponsored the webinar), though other technologies were touched on.
By "exome sequencing" what is generally meant is to capture & sequence the exons in the human genome in order to find variants of interest. Exons have the advantage of being much more interpretable than non-coding sequences; we have some degree of theory (though quite incomplete) which enables prioritizing these variants. The approach also has the advantage of being significantly cheaper at the moment than whole genome sequencing (one speaker estimated $20K per exome). So what's the problem?
My concern is that the terms "exome sequencing" are taken a bit too literally. Now, it is true that these approaches catch a bit of surrounding DNA due to library construction and the targeting approaches cover splice junctions, but what about some of the other important sequences? According to my poll of practitioners of this art, their targets are entirely exons (confession: N=1 for the poll).
I don't have a general theory for analyzing non-coding variants, but conversely there are quite a few well annotated non-coding regions of functional significance. An obvious case are promoters. Annotation of human promoters and enhancers and other transcriptional doodads is an ongoing process, but some have been well characterized. In particular, the promoters for many drug metabolizing enzymes have been scrutinized because these may have significant effects on how much of the enzyme is synthesized and therefore drug metabolism.
Partly coloring my concern is the fact that exome sequencing kits are becoming standardized; at least two are on the market currently. Hence, the design shortcomings of today might influence a lot of studies. Clearly sequencing every last candidate promoter or enhancer would tend to defeat the advantages of exome sequencing, but I believe a reasonable shortlist of important elements could be rapidly identified.
My own professional interest area, cancer genomics, adds some additional twists. At least one major cancer genome effort (at the Broad) is using exome sequencing. On the one hand, it is true that there are relatively few recurrent, focused non-coding alterations documented in cancer. However, few is not none. For example, in lung cancer the c-Met oncogene has been documented to be activated by mutations within an intron; these mutations cause skipping of an exon encoding an inhibitory domain. Some of these alterations are about 50 nucleotides away from the nearest splice junction -- a distance that is likely to result in low or no coverage using the Broad's in solution capture technology (confession #2: I haven't verified this with data from that system).
The drug metabolizing enzyme promoters I mentioned before are a bit greyer for cancer genomics. On the one hand, one is generally primarily interested in what somatic mutations have occurred on the tumor. On the other hand, the norm in cancer genomics is tending towards applying the same approach to normal (cheek swab or lymphocyte) DNA from the patient, and why not get the DME promoters too? After all, these variants may have influenced the activity of therapeutic agents or even development of the disease. Just as some somatic mutations seem to cluster enigmatically with patient characteristics, perhaps some somatic mutations will correlate with germline variants which contributed to disease initiation.
Whatever my worries, they should be time-limited. Exome sequencing products will be under extreme pricing pressure from whole genome sequencing. The $20K cited (probably using 454 sequencing) is already potentially matched by one vendor (Complete Genomics). Now, in general the cost of capture will probably be a relatively small contributor compared to the cost of data generation, so exome sequencing will ride much of the same cost curve as the rest of the industry. But, it probably is $1-3K for whole exome capture due to the multiple chips required and the labor investment (anyone have a better estimate?). If whole mammalian genome sequencing really can be pushed down into the $5K range, then mammalian exome sequencing will not offer a huge cost advantage if any. I'd guess interest in mammalian exome sequencing will peak in a year or two, so maybe I should stop worrying and learn to love the hyb.