After introductions and teasing out past collisions, I gave a thumbnail sketch of what I had synthesized from available sources, and they pointed out two major misconceptions I (and others) have been propagating (well, I think the one mostly me; maybe both mostly me). While it is certainly true that their technology is an entry in the synthetic long read game, they do not attempt to reconstruct each fragment which goes into their system. In this way they are closer to Complete Genomic's Long Fragment Read (LFR) technology than Illumina's Moleculo (officially TruSeq Synthetic Long-Read, but who can let a Conan O'Brien neologism die easily ). An advantage of this approach is that truly reassembling each fragment requires oversampling it, and getting sufficient coverage of the genome as a whole requires oversampling at the fragment level. Oversampling times oversampling means really large oversampling; if one shot for 20X read coverage per fragment and 20X fragment coverage of the genome, that's 400X overall coverage one must pay for. Instead, they shoot for a more typical 20X-ish coverage; the 10X system is simply providing reads (with a very low PCR duplicate rate of 1% with 1ng input, significantly than a TruSeq library prep with 50ng input according to their data) which carry additional information.
Software: Loupe
Instead, each read is tagged as to which compartment (with over 100K compartments and barcodes in the system) it arose from. Using their open source pipeline called Martian, this information can be used to generate phased haplotype information for human genomes, starting with the BCL file output by the sequencer. This is an important point: the initial commercial system will be for human genome resequencing only, though they have active collaborations exploring uses for de novo assembly, microbiomes and other applications that could leverage the technology. A new open source, multi-platform (Mac, Windows, Linux) genome browser developed by 10X, Loupe, allows scientists to see the level of evidence for the haplotype information. Views from Loupe, showing correlation between tags. Data is provided as phased VCFs, BAM files with phasing tags and other standard formats. The entire package, save a critical Illumina-supplied too. The pipeline can run on a variety of Linux flavors, and can also utilize load sharing systems such as LSF or SGE.
The 10X team pointed out that this haplotype information can resolve many complex repeats in the human genome, though really hairy long simple tandem repeats (STRs) are a pathological case (as noted by the Broad Institute's David Jaffe, who had access to a rough cut of the GemCode system, in another AGBT talk,. This can enable resolving very complex structural variants. %GC extremes are not a problem, other than the issues which Illumina sequencing can experience with them. As shown below, phased reads can be used for very convincing detection of structural variants.
One intriguing application of the technology is to add phasing information to exome capture. Libraries prepared on the instrument look like any other Illumina library, save some additional barcoding. By simply including a degenerate blocking oligonucleotide which can be ordered from any supplier, 10X-prepared libraries can be run through standard hybridization capture protocols. In this way, on target reads which were came from the same original fragment can be detected. Unusual linkages can then be detected, which means gene fusions with intronic breakpoints can be detected at the genomic level, so long as the input fragments are long enough to span the intron, as seen below. Levi Garraway (also with the Broad) showed similar detection of rearrangements using exon capture data from prostate cancer (and I learned a new word, chromoplexy, which is a genomic catastrophe mode similar to but distinct from chromothripsis). 10X also pointed out that tagged reads can be a powerful way to filter out mismappings; just as having a paired end can assist with detecting wrong mappings, but now one has multiple reads to support or undermine a given mapping.
10X has shown this for my old friend, the highly clinically relevant fusion EML4-ALK seen in lung adenocarcinomas and some other solid tumors. While the wild type EML4 and ALK genes are separated by about 13Mb, an inversion brings these close enough to generate linked reads in the 10X system, as shown in the figures below. Since no one has yet commercialized a long fragment capture protocol, 10X may have a unique offering.
10X has shown this for my old friend, the highly clinically relevant fusion EML4-ALK seen in lung adenocarcinomas and some other solid tumors. While the wild type EML4 and ALK genes are separated by about 13Mb, an inversion brings these close enough to generate linked reads in the 10X system, as shown in the figures below. Since no one has yet commercialized a long fragment capture protocol, 10X may have a unique offering.
Wetware
To summarize the 10X experimental workflow, their $75K box takes proprietary microfluidic cards, which the user loads with reagents and unsheared DNA for eight samples. Remarkably (and an important difference from Moleculo), the system requires only 1 nanogram (5 nanocarats?) of input material, in contrast to 500 nanograms for Moleculo. The device runs for five minutes to partition the material into the barcoded gel beads ("GEMs") and the material is ready to go into a standard thermocycler to run the first half of 10X's chemistry. This process could be repeated until the desired number of samples are ready to go.
Their proprietary biochemistry is activated using a standard thermocycler. The library construction, contrary to my previous misconception, is not transposase-based, though no further details were given (perhaps it is along the lines of LFR?). Within the GEMs, a limited, linear amplification of the input material (vs. MDA in LFR and PCR for Moleculo) occurs; 10X believes this generates less distortion than the competing approaches. All this results in fragments which are tagged on one end with the P5 primer and compartment barcode.
After a simple, one-reagent, emulsion-breaking step, the remainder of the protocol largely resembles a typical library prep workflow, with end repair, A-tailing and ligation of the P7 primer, followed by PCR-based addition of sample barcoding tags. Then off to the sequencer! This prep process is estimated to be "a long day" if trying to run 96 samples -- perhaps an hour and a half of 5-minute x 8 sample batches on the 10X instrument, followed by the rest of the workflow.
Their proprietary biochemistry is activated using a standard thermocycler. The library construction, contrary to my previous misconception, is not transposase-based, though no further details were given (perhaps it is along the lines of LFR?). Within the GEMs, a limited, linear amplification of the input material (vs. MDA in LFR and PCR for Moleculo) occurs; 10X believes this generates less distortion than the competing approaches. All this results in fragments which are tagged on one end with the P5 primer and compartment barcode.
After a simple, one-reagent, emulsion-breaking step, the remainder of the protocol largely resembles a typical library prep workflow, with end repair, A-tailing and ligation of the P7 primer, followed by PCR-based addition of sample barcoding tags. Then off to the sequencer! This prep process is estimated to be "a long day" if trying to run 96 samples -- perhaps an hour and a half of 5-minute x 8 sample batches on the 10X instrument, followed by the rest of the workflow.
What sort of DNA can go into the workflow? 10X emphasized that DNA from standard prep kits will work just fine and yield a lot of additional information, since the fragment length will be in the neighborhood of 50 kilobases. If you want to use more gentle preps, perhaps even digestion in plugs to give multi-hundred kilobase fragments (as my team did in a project with BioNano Genomics), then you'll get correspondingly longer linkage information. But even with standard DNA preps, haplotype block N50s from 23X whole genome on on the order of 1-2 Mb, with the extreme blocks up around 10Mb in length. Short switch error rates (in which a few SNPs are wrongly phased) are on the order of 0.3%, with long switch error rates (in which stretches of SNPs are mis-assigned around 0.01%).
Okay, that all sounds wonderful. what's it going to cost? In addition to the $75K instrument cost, each library will run about $500. That's not a typo -- 10X sees the current library reagent market as a race-to-the-bottom cutthroat commodity market which is unappetizing, particularly to a venture-funded startup. On ScienceExchange.com, there are at least teaser prices for exome sequencing for less than that, so getting phased exomes could easily be doubling the cost. Genomes on Illumina's X10 platform have a raw cost of $1K (yes, yes, I know there are a lot of costs that number doesn't include; I'm not versed in what honest estimates for those subsidiary costs -- storage, analysis, etc -- are); even if library prep there is $50 doing an 10X library on the X10 (just talking about this stuff involves inverted repeats!) is just shy of a 50% increase. Still, that would seem less expensive than the heavy oversampling required for a high-coverage Moleculo genome, and certainly less than a high-coverage PacBio genome (my current estimate -- very rough -- is about $40K for PacBio with the current chemistry). So if you want direct high resolution haplotypes, this may be the least expensive option currently on the table. 10X emphasizes that haplotypes are computed only from the given data; no imputation is performed from known SNP phasings.
So the 24 karat question is to what degree will the genomics community decide that the added clarity of haplotype information, colored by the question of whether cut they take on the number of genomes which can be sequenced for a given budget. I'm too much of an armchair human genomicist to give a good answer. I could certainly imagine applying this in research settings, particularly to establish haplotype information for a diverse set of human samples or for standards such as Genome In A Bottle. Maybe groups studying chromothripsis in cancer (or, in a stunning recent story, non-cancer tissue). Perhaps for really puzzling clinical cases. But will will phased exomes take off if that means running half as many exomes? Will some groups run 10X routinely, or will it be more of a second-round approach when standard sequencing fails to give a clear picture? In short, will 10X be the class ring worn daily and ubiquitously, or the family treasures brought out of safekeeping only for special occasions?
A widely tweeted talk by David Page, which I've attempted to summarize using Storify, concerns the extremely repetitive nature of mammalian Y-chromosomes, which contain many amplicons of varying sizes. Page has developed a method he calls SHIMS, which involves making BAC libraries, cloning out Y-chromosome BACs and sequencing on the MiSeq. In this way, he gets haplotype-resolved information on some beastly repeats, such as a 500kb repeat present in about 180 copies. I can't help thinking this would be an acid test for 10X or similar technologies; can you resolve regions when a 105Kb overlap has only 11 nucleotide substitutions.
Given that big venture kitty, I would expect 10X to expand their offerings once the first wave of customers shakes out any remaining hiccups in the system (though 10X has generated libraries from over 4K samples at this point). Adapting to other organisms should be largely a case of importing new references. While important research mammals such as mice may be an area of interest, I would expect a major market to be in plant genomes, particularly for agricultural breeding purposes (and perhaps also in livestock). Plants also bring in the fun of tetraploidy, hexaploidy, octaploidy and such.
My own heart would lie more in the direction of de novo assembly, but that will require some additional software development. Jaffe's talk discussed using 10X in the context of de novo assembly of human cancer genomes, with the DISCOVAR assembler taking advantage of the additional information (the available tweets don't offer much higher resolution than that statement). It would be great to see some of the major assembly/scaffolding bioinformatics groups join in as well, which would best be stimulated by 10X or partners making available datasets for some genomes (perhaps ones used in Assemblathon?).
Library prep technology could be seen as a bit of a theme to this year's AGBT. Tomorrow should bring another post on a specific technology, and soon after the meeting ends I have some broader thoughts on the space as a whole.
(2015-03-03 -- some corrections made to comments about software -- Loupe is not open source)
this video of sitting out haplotypes is getting raves Vimeo
ReplyDeleteWhat is the cost to sequence a phased human genome to a quality similar to that achieved by 30X Illumina coverage?
ReplyDelete10X's device simply adds value to the library (minus, perhaps, some read length burned on the inline barcodes), so the answer to your question should be just the added cost of 10X library prep vs. whatever library prep was used before.
ReplyDeleteI have a hard time seeing this applied everywhere. Unless it can actually resolve long STRs, it's a niche application and most of us will wait for true long reads.
ReplyDeleteAs always, thanks for the great write-up. This was much more informative than their presentation!
Oh, and library prep on an exome/genome runs about $20 in raw reagent costs
ReplyDelete