To take the simplest case, imagine you are planning to sequence a human female's genome, one which you know is exactly like the reference human female genome in structure -- meaning they will have exactly two copies of each region of the genome and will have typical numbers of SNPs. Can you find all the SNPs?
In an ideal world, and some technologists dreams (more on that later this week), you would simply split open one cell, tease apart all the chromosomes, and read each one from end-to-end. 100% coverage, guaranteed.
Unfortunately, we are nowhere near that scenario. While chromosomes are million of nucleotides long, our sequencing technologies read very short stretches. 454 is currently claiming ~200 long reads (though the grapevine suggests that this is rarely achieved by customers), and the other next generation sequencing technologies are expected to have read lengths in the 20-40 range or so. SNPs are, on average, about once every kilobase or so, so there is a big discrepancy. The linkage of different SNPs to each other on the same chromosome is called a haplotype.
Haplotypes are important things to be able to identify & track. For example, if an individual has two different SNPs in the same gene, it can make a big difference if they are on the same chromosome or different chromosomes. Imagine, for example that one SNP eliminates transcription of the gene while the other one generates a non-functional protein. If they are on the same chromosome (in cis), having two null mutations is no different than just one. On the other hand, if on different chromosomes (in trans) means no functional copy is present. Other pairs of SNPs might have the potential to reinforce or counteract each other in cis but not in trans.
The second challenge is we don't in any way start at one end of a chromosome and read to the other. Instead, the genome is shattered randomly into a gazillion (technical term!) little pieces.
If you think about this, if we look through all the sequence data from a single human (or canine or any other diploid) genome, we can sort the sequences into two bins
- Positions for which we definitely found two different versions
- Positions for which we always found the same nucleotide
Category #2 will contain mostly sequences in which the genome of interest was identical for both copies (homozygous), but it could also contain cases where we simply never saw both copies. For example, if we saw a given region in only one read, we know we couldn't possibly have seen both copies. Category #1 will have some severe limits: we can link sequences to SNPs only if they are within a read length of an informative SNP (one which is heterozygous in the individual), and actually will generally be able to see much shorter (since the informative SNP will rarely do us the honor of being at the extreme beginning or end of a read).
This immediately suggests one trick: count the number of times we see a copy. Based on a Poisson distribution we can estimate whether it is likely that every read we saw was derived from the same copy.
Of course, Nature doesn't make life easy. Many regions of the genome are exact repeats of one form or another. A simple example: Huntington's disease is due to a repeating CAG triplet (codon); in extreme cases the total length of a repeat array can be well over a kilobase, again far beyond our read length. Furthermore, there are other trinucleotide repeats in the genome, and also other large identical or nearly identical repeats. For example, we all carry multiple identical copies of our ribosomal RNA genes and also have a bunch of nearly identical copies of the gen for the short protein ubiquitin.
There is one more trick which can be used to sift the data further. Many of the next generation technologies (as well as Sanger sequencing approaches) enable reading bits of sequence from two ends of the same DNA fragment. So, if one of the two reads contains an informative SNP but the other doesn't, then we know that second region is in the same haplotype. Therefore, we could actually see that region only twice but be certain that we have seen both copies. With fragments of the correct size, you might even get lucky and get a different informative SNP in each end -- building up a 2-SNP haplotype.
This is particularly relevant to what I suggested yesterday. Suppose, for example, that the relevant mutation is a Mendelian dominant. That means it will be heterozygous in the genome. In regions of the genome that are poorly sampled, we won't be sure if we can really rule them out -- perhaps the causative mutation is on the haplotype we never read in that spot.
Conversely, suppose the causative mutation was recessive. If we see a rare SNP in a region which we read only once, we can't know if it is heterozygous or homozygous.
Large rearrangements or structural polymorphisms have similar issues. We can attempt to identify deletions or duplications by looking for excesses or deficiencies in reading a region, but that will be knotted up with the original sampling distribution. The real smoking gun would be to find the breakpoints, the regions bordering the spot where the order changes. If you are unlucky and miss getting the breakpoint sequences, or can't identify them because they are in a repeat (which will be common, since repeats often seed breakpoints), things won't be easy.
Of course, you can try to make your own luck. This is a sampling problem, so just sample more. That is a fine strategy, but deeper sampling means more time & money, and with fixed sequencing capacity you must trade going really deep on one genome versus going shallow on many.
You could also try experimental workarounds. For example, running a SNP chip in parallel with the genome sequencing would enable you to ascertain SNPs that are present but missed by sequencing, and would also enable finding amplifications or deficiencies of regions of the genome (SNP chips cannot, though, directly ascertain haplotypes). Or, you can actually use various cellular tricks to tease apart the different chromosomes, and then subject the purified chromosomes to sequencing or SNP chips. This will let you read out haplotypes, but with a lot of additional work and expense.
I did, at the beginning, set the scenario with a female genome. This was, of course, very deliberate. For most males, most of the X and Y chromosomes is present at single copy (a region on each pairs with the other, the so-called pseudoautosomal region, and hence has the same haplotyping problem). So the problem goes away -- for a small fraction of the genome.
We will soon have multiple single human genomes available for analysis: Craig Venter will apparently be publishing his genome soon, and James Watson recently received his. It will be interesting to see how that haplotyping issue is handled & plays out in the early complete genomes, and whether backup strategies such as SNP chips are employed.