One of the biggest challenges with short read sequencing technology is the short part; another is accuracy. From a given read, one can only ascertain information on a size scale equal to the read. This can be improved on a bit by paired end approaches, which read two ends of a fragment. That gives some orientation information, and if the size of the fragment is relatively predictable, then some distance information.
Ultimately, there is a relatively small limit on the size of a fragment (generally well under a kilobase, though I think I've seen discussion on SEQAnswers of kilobase+ fragments in Illumina) which can function in the systems, so to go longer mate-pair approaches were developed. These generate two reads known to be separated by a relatively long distance, on the scale of kilobases. These solve the problem by a bunch of trickery eliminating the intervening DNA but retaining linked tags. Such schemes typically involve shearing, ligation (or in vitro recombination) into circles, another shearing step, capture of the junction fragments and conversion into a library. A bunch of work, and with a number of serious disadvantages. In particular, they tend to require tremendous amounts of upfront DNA, often tens of micrograms. The length of the original inserts are limited by what can be efficiently circularized. Most mate pair approaches shoot for under 20Kb, though Lucigen has a clever kit aimed to produce 40Kb+ mate pairs -- but the catch is you are actually making a cosmid library first, with an input DNA requirement pushing 100 micrograms.
Complete takes a very different approach, but one with a lot of precedent. Indeed, it is basically executing the concept of HAPPy sequencing, which I covered almost three years ago. First, partition high molecular weight DNA into aliquotes such that each aliquot has one half a genome's worth of DNA (or less). In such pools, two alleles that are physically linked will tend to show up together, whereas two alleles near each other in position but coming from different chromatids will rarely if ever be found together. By simply sequencing many such pools deeply, the frequency of co-occurrence of alleles can be measured, and used to build a map -- analogous to linkage mapping. Furthermore, true variants will show up in a pool repeatedly, whereas sequencing errors will be scattered, so a higher accuracy can be obtained. Simple!
Of course, in practice, not so simple. A lot of method development went into reducing this to practice, and almost as importantly, to economical practice. Furthermore, they've optimized the process to work with tiny numbers of cells, sequencing a human genome from as few as 10 cells.
The workflow is straightforward in concept. High molecular weight DNA is purified from cells and diluted, then aliquoted into 384 wells. The minuscule amount of DNA in each well is amplified by Multiple Displacement Amplification (MDA). A modification of typical MDA enables random fragmentation in the same tube, without purification -- MDA is performed with a small amount of deoxyuracil in the mix. Single-strand nicks are introduced enyzmatically at the uracils, followed by nick-translation to generate double-strand breaks (each nick is moved, or translated, by the action of polymerase until a nick on the other strand is encountered). Barcoded adapter ligation occurs in the same vessel (the steps do not occur simultaneously; I've left out a lot of reagent additions and heat inactivations). Amazingly, the whole process is claimed to add only $100 to library construction; at list prices the 384 MDA reactions would run about $2K alone.
Once these libraries are prepared, they can be pooled and sequenced on Complete's short read platform -- 35bp is short! But, these are mate pairs, giving more information. Most importantly, each pool can now be read and scored for variants. These are fed into an algorithm, summarized in their Figure 2 below
A rough summary (based on my rough understanding), is that heterozygous SNPs are identified (a) and their connectivity between pools computed (b). This suggests a graph (c), which is optimized (d) and then resolved into contigs of alleles (e). If parental allele information is available, then labeling of the contigs by parental origin is possible (f).
On the accuracy front, they claim to improve their previous SNP-calling accuracy by 10-fold, to about 600 false SNVs per human genome. For structural variants, the long haplotypes give much greater detection power, though the MDA process can generate false variants through chimaera formation.
For applications, one demonstrated in the paper is an improved ability to detect compound heterozygotes. In particular, they looked for cases in which both alleles of the same gene carry splice site mutations, but at different locations within the gene. This is an obvious use case for haplotypes, as distinguishing two bad copies from one good and one really beaten one is important.
Long-term, Complete Genomics is a business. I'm not a fan of staring at stock charts (and certainly don't believe they offer any insights into future performance), but Complete's has been rough. Most of this year, they've been below 5 and lately near 2. The announcement provided some boost, but long-term the question remains whether the market exists for factory-scale human genome sequencing -- or if it will develop quickly enough to sustain Complete. Complete's heavy investment in idiosyncratic sequencing technology may pay off, or it may simply raise their costs in a way that lighter competitors gain an edge. I've always felt Complete's focus on human genomes was operationally brilliant; whether it works as a business will remain to be seen. The LFR process appears to add value without substantially raising their cost, but will customers flock towards Complete and away from competitors? Only time will tell.
There are a lot of cool innovations in the molecular biology, none of them unprecedented but the combination is clever. Complete must feel they can safely disclose the system due to patent protection (it should be noted that the paper cites 2 patents/patent applications, or about 2 more than just about any other Nature paper). Good for them, but I guess the rest of us can just pine for the LFR pieces, which could be pretty darn useful in any de novo genomics effort.