Mate pair approaches are the oldest high-throughput approach. In these, sheared, size-selected DNA is tagged with adapters, circularized and then the DNA flanking that circularization junction made available to sequence. Mate pairs suffer from often labor-intensive workflows, contamination with both non-junction reads and false junctions from inter-molecular (rather than intra-molecular) ligations, and also have a limited ability to resolve truly complex repeats. For example, if both ends of the mate pair map to perfectly repeated sequence, then that mate pair is nearly (or completely) useless. Conventional mate pair libraries are also limited in the size range for the input fragments; I think few reliable schemes have been published for fragment sizes even in the 20Kb range (or at least, those that claim to reliable have few follow-on publications). There is a clever approach from Lucigen to sequence the ends of a fosmid library, but these have not seen widespread use. Mate pair approaches also require large amounts of input DNA, both because they are multiply inefficient with input DNA (much of the fragmented DNA is wrong size and only the ends of correct fragments end up in the library) and many of the manipulations need to be run in relatively large volumes. However, mate pairs do leverage the spectacular economics of generating data on short read sequencers.
The obvious alternative is to just sequence long molecules directly. PacBio and Oxford Nanopore are the two commercially available technologies for this. PacBio has now been used for multiple, very complex and very large genomes, including human. Oxford just announced a human genome dataset, though at relatively low coverage and no statistics yet for what information can be extracted from it. However, both these technologies suffer from high error rates and higher costs per base than Illumina. While clever (and continuing to improve) informatics approaches enable these reads to be used for assembly and structural variant calling, for some applications they may be unable to stand on their own.
Linked reads are a class of technology in which multiple reads are grouped, and potentially ordered, into sets known to originate from individual input molecules. An early such technology was Complete Genomics' Long Fragment Read technology, though this was only ever offered within Complete's human genome sequencing offering. Moleculo, a startup acquired by Illumina and then marketed as TruSeq Synthetic Long Reads, attempts to reconstruct individual input molecules in the 5-10Kb range via a process involving PCR and a high sampling in the sequencing stage. Illumina, in collaboration with Jay Shendure's group, developed another method called Contiguity Preserving Sequencing which leverages the Nextera transposase technology, but Illumina never marketed this approach.
Probably the best known linked read technology is 10X Genomics', which first packages DNA in barcoded microdroplets, which is then converted to a set of fragments bearing the droplet's barcode. After emulsion breaking, some conventional preparation steps are used to complete the library. A large number of barcodes (as many as 1 million) are used, with each DNA molecule converted to a small number of fragments (a low sampling rate). 10X supplies software for haplotype phasing from this information, as well as a de novo assembler called SuperNova. Recent improvements in SuperNova have lowered the recommended memory to only 384Gb of RAM.
If you understand 10X's offering, then you'll quickly understand the approach of iGenomx, but that's not to say it's some simple knock-off. IGenomx is different in many details, with the potential for very different performance characteristics.
As an aside, while iGenomX is targeting linked reads via microfluidic encapsulation, the core of the process is both simple and elegant, leading me to wonder if it could be adapted to rapid, simpler, lower-cost library preparation for individual samples. How is simple? I'm reminded of a quote from Ackroyd: "no scaling, gutting or cutting" -- or in this case, no fragmentation, end repair, A-tailing or ligation. The whole process is also automation-friendly. It is also a reminder that there is still a lot of room to innovate in short read library preparation.
iGenomX starts with 1ng DNA, as long as possible (as I've noted at the beginning of the year and just recently, DNA prep, particularly for long fragments is ripe for innovation). These are encapsulated into picoliter droplets and merged with a library of barcoded primers using RainDance Technologies' ThunderStorm instrument. This $250K instrument processes 96 samples in an overnight run. Primers (to be explained shortly) are contained in smaller droplets than the large droplets containing both input DNA and master mix. A disposable microfluidic device is used here, which might be the most expensive consumable in the process.
Inside the droplets, the barcoded primers containing a partial P5 sequence land on the long DNA fragments via randomized eight basepair sequences at the 3' ends. These are extended along the input fragment by a thermostable, non-displacing polymerase, until extension is ended by biotinylated terminator nucleotides. All four terminators are present in each reaction, with the ratio of natural nucleotides to terminators determining the length of the extension. iGenomX commented to me that this ratio could be tweaked to tune the amplifications for G+C content, and that the primer designs also offer opportunities for tuning. The in-droplet reactions can be cycled to generate sets of fragments from the original molecule. Note that the input molecule is never fragmented, so it is possible to completely cover it with copies. Since the copied molecules are terminated, they cannot prime elsewhere on the input. Hence, few of the created fragments will be copies of copies.
Next, the emulsions are broken to release the DNA. Newly created fragments are captured on biotinylated beads, with all of the original DNA and any unterminated fragments washed away. A P7-bearing primer is added via the same random-priming approach, except this time a strand-displacing polymerase is used. Multiple primers may sit down on the terminated fragments from step one, but only the longest will still be bound after extension; those that bind closer to the 3' end will be evicted by the strand-displacing polymerase (the figure below illustrates this prior to displacement). Washing the beads removes these shorter fragments. PCR directly on the beads serves to release the library molecules and amplify them, as well as complete the P5 and P7 sequences and add a sample-specific non-inline barcode to the P7 side. After size selection (currently with Pippin, but SPRI beads should work as well), the library is ready to go on the sequencer.
A key difference between how iGenomX is tuned versus 10X is in the coverage of each molecule. As noted above, 10X uses a very large number of barcodes (up to 1 million) but samples each molecule relatively sparsely. In contrast, iGenomX is using small numbers of barcodes, but aims to sample the molecules deeply. In their datasets they will be releasing, iGenomX used only 1544 barcodes. This will soon increase to 15K. and they see no limitation to making a higher diversity of barcodes. But conversely, at this time iGenomX doesn't see a need, as they are happy getting many reads from the same molecule. At this time their pipeline, which will be released open source and simply leverages existing tools such as Picard, BWA and IGV, does not try to assemble the reads by barcode, but this is certainly a possibility. Nor does the current pipeline perform barcode-aware alignment nor aligning to a graph representation of genomes. Clearly there are opportunities for further improvement.
A point which iGenomX's Keith Brown stressed to me throughout our telephone conversation is that their team has aimed for highly uniform coverage within each droplet, which they believe is a product of their chemistry and also potentially tunable further. In the alignment plots below, the uniformity of the coverage is reflected in the rarely changing slope of the path formed by the successive fragments.
At an ASHG workshop today, iGenomX and collaborators are describing a number of datasets. Statistics on IGenomX's sequencing of the NA12878 human genome sampl, showing at with 28X mean coverage over 91% of the genome was covered at 10X coverage.
On this highly-studied, Genome In A Bottle sample, the concordance for SNP and small indel calling with GIAB results was very high.
For haplotype phasing, iGenomX found a short switch rate of 0.0002% and a long switch rate of 0.0225%, with a haplotype block N50 of 16Mb.
Launch and Possible Applications
IGenomX is launching the product today via agreements with two of the test sites, Hudson Alpha Institute of Biology and the Scripps Institute. I don't have pricing right now, but you can bet I will soon (only proper way to understand these is to try them out!). Shawn Levy from Hudson Alpha will be one of the presenters at the workshop, along with Nicholas Schork from the J. Craig Venter Institute. It will run on any Illumina sequencer, including the X5/X10 family.
I'm not close enough to the human genomics scene to understand the degree to which 10X is or is not satisfying researcher's needs in the haplotyping field. Brown did comment that while he couldn't give me specific pricing, the actual costs of goods for the system are very low, because there are so few enzymes in the system. He expressed a hope of not just matching the cost of a conventional library prep, but perhaps even beating it. If this were the case, then it would presumably remove a major obstacle to generating haplotype-resolved genome sequences. If, as widely reported, 10X adds about 50% to the cost of generating a genome on the Illumina X10, then iGenomX might bring the premium down to a pittance.
10X has looked interesting for other applications such as large genome de novo sequencing and metagenomics, but due to still being quite new these haven't yet hit publication (though I could have missed them). As mentioned above, one nuisance is the memory-hungry linked-read aware de Bruijn assembler SuperNova. I had written this spring of the scarcity of cloud instances (I had missed offerings from Microsoft) capable of supporting this beast; Amazon now has one but the meter spins at about $13 per hour. In contrast, one could imagine tuning the iGenomX prep to very deeply sample each input molecule, enabling pre-assembly of barcode pools before any grand assembly -- or perhaps just stopping more conservatively with the pre-assemblies. Since each pool is small, any conventional short read assembler (such as SPADES) should work fine on a typical compute node. Such pre-assembled reads could then be fed as long reads into SPADES or Canu or whatever other favorite assembler you have. None of these are yet able to be barcode-aware, so linkages between fragments from the first round assembly won't be promoted, but it could be a very memory-efficient and practical assembly strategy. I would also see such an approach as attractive in sequencing highly heterogeneous cancer samples, avoiding mixing information that may have come from different cells (or at least controlling that mixing).
iGenomX is releasing a small number of datasets. There are a lot of interesting applications and a lot of pressure-testing questions to ask within the existing applications, and Brown was quite upfront that iGenomX hasn't asked most of these questions or tried most of these applications. Rather than wait for the company to get around to them, they are pushing the technology out to a few core labs which can start generating samples for interested investigators.
What if you want your own? I'd say the economics of the system extremely strongly bias it towards very large core labs. The ThunderStorm is a $250K instrument, and that price gets you a large, complex robot taking up a lot of bench space and with many moving parts (I didn't think to ask how much of the process can be automated on ThunderStorm). It will process 96 samples a day, with the overall throughput of one FTE rated by iGenomX at 384 samples per week. That's working out to Illumina X10 sort of throughput . Even if you process only one batch of 96 per week, that's still a pace for about 4500 genomes per year, which is quite a run rate (which must be paid for). So unless you're cranking out lots of sequences, or just can never tolerate waiting in a queue, you should just use one of the cores that are getting the setup (and there is no shortage of core lab Illumina capacity; James Hadfield has a nice piece on this today) -- and since no other gear is required, it would be surprising if every core that has a ThunderStorm doesn't sign on to provide iGenomX services.
In any case, what exciting times for long reads! PacBio continues to drop costs, narrowing the gap with Illumina. Oxford Nanopore keeps upping the performance on MinION, enabling 30X human genome coverage to be generated using only 14 flowcells, and at some point hopes to have the bigger PromethION online. Genia looks like their chemistry is about settled, as I reported in my previous dispatch. 10X has been gaining traction. Increasing number of publications demonstrating superior results from long reads, particularly on the PacBio platform but also on the other launched ones, builds further interest and ultimately make short read only genome sequences recognized as undesirably inferior. That transition in opinion may be a year or two off, but as the cost gap closes, fewer researchers will be resistant to the lure of higher quality genomes and haplotypes.