Sunday, May 31, 2009

Teasing small insertion/deletion events from next-gen data

My interest in next-generation sequencing is well on the way from shifting from hobby to work-central, which is exciting. So I'm now really paying attention to the literature on the subject.

One of the interesting uses for next-generation sequencing is identifying insertion or deletion alleles (indels) in genomes, particularly the human genome. Of course, the best way to do this is to do a lot of sequencing, compare the sequence reads against a reference genome, and identify specific insertions or deletions in the reads. However, this is generally going to require a full genome run & a certain amount of luck, especially in a diploid organism as you might not sample both alleles enough to see a heterozygous indel. A cancer genome might be even worse: these often have many more than two copies of the DNA at a given position and potentially there could be more than two different versions. In any case, full genome runs are in the ballpark of $50K, so if you really want to look at a lot of genomes a more efficient strategy is needed.

The most common approach is to sequence both ends of a DNA molecule and then compare the predicted distance between those ends with the distance on the reference genome. If you know the distribution of lengths that the sequence library has, then you can spot cases where the length on the reference is very different. In effect, you've lengthened (but made less precise) your ruler for measuring indels, and so you need many fewer measurements to find them.

One aside: in a recent Cancer Genomics webinar I watched a distinction was made between "mate pairs" and "paired ends" -- except now I forget which they assigned to which label (and am too lazy/time strapped to watch the webinar right now). In short, one is the case of sequencing both ends of a standardly prepared next-generation library, and the other involves snipping the middle out of a very large fragment to create the next-gen sequencing target. Here I was prepared to go pedantic and I'm caught napping!

Of course, that is if you know the distribution of DNA insert sizes. While you might have an estimate from the way the library is prepared, an obvious extension would be to infer the library's distribution from the actual data. An even more clever approach would be to use this distribution to pick out candidates in which the paired end sequences lie well within the distribution, but are consistently shifted relative to that distribution.

A paper fresh out of Nature Methods (subscription required & no abstract) incorporates precisely these ideas into a program called MoDIL. The program also explicitly models heterozygosity, allowing it to find heterozygous indels.

In performance analysis on actual human shotgun sequence, the MoDIL paper claims 95+% sensitivity for detecting indels of >=20bp. I tfor library used, this is detecting 10% length difference (insert size mean: 208; stdev: 13). The supplementary materials also look at the ability to detect heterozygous deletions of various sizes as a function of genome coverage (the actual sequencing data used had 120X clone coverage, meaning the average nucleotide in the genome would be found in 120 DNA fragments in the sequencing run). Dropping the coverage by a factor of 3 would be expect to still pick up most indels of >=40.

Lee, S., Hormozdiari, F., Alkan, C., & Brudno, M. (2009). MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions Nature Methods DOI: 10.1038/nmeth.f.256


Martin said...

How does it handle very long indels? Is it possible that some things could get missed if they exceed the length of the cut DNA?

-Martin Gollery

Keith Robison said...

I think as long as your paired-end reads can be mapped accurately (i.e. the breakpoint isn't in the middle of your end read!), long indels shouldn't give any trouble -- indeed, those are the ones which previous approaches could catch because they stick out so much.