Sunday, January 18, 2015

JPM Wrap-Up:

In this final installment of a series of reactions to news coming from the J.P. Morgan Conference, I'll cover an interesting complementary technology that was announced.  But first, it might appear my prediction of no radical sequencer announcements may have been invalidated, with an announcement from BGI of plans to launch two sequencers based on Complete Genomic's technology.  Unfortunately, the only outlet that seems to have covered this is GenomeWeb, and it is in their premium (paywalled) section, so I know nothing beyond that.  It appears this was only announced around JPM and not at JPM, so I have a Clintonesque out as well.

Previously stealthy 10X Genomics announced both a mongo financing round ($55.5M) and a partial description of their proposed product in the long read space.  10X appears to be a microfluidic cousin of Moleculo, partioning the input sample into many droplets and then generating a sequencing library from each droplet.  Enormous numbers of barcodes mean most if not all of these droplets will have unique barcodes.  Input DNA requirement is projected at a miniscule 1ng, but these can be as long as 100kb or more (strongly suggesting that any amplification step uses MDA or a similar technology rather than long PCR).  After sequencing on the Illumina platform, software from 10X sorts the reads to partitions and assembles each partition's data into long synthetic reads.

It is worth reviewing the theoretical strengths and weaknesses of synthetic long reads. The goal of these partitioning strategies is to split the input genomic material of high complexity into many smaller genomes of much lower complexity.  In all cases, this enables the resolution of repeats if these subgenomes are sufficiently simple.  10X suggests that their compartments might contain a single DNA sequence, which would be as simple as possible -- but perhaps not always simple enough.  Such a reduction will enable assembling through long interspersed repeats, but not if multiple such repeats occur within a partition.  So, for example, if bacterial ribosomal RNA genes are dispersed throughout the genome with a spacing greater than the fragment size, they could be resolved.  On the other hand, if a fragment contains two or more copies of such a large repeat then any one fragment may not be assemblable, though a clever algorithm might be able to leverage multiple such reads (with different endpoints) to resolve the structure.  For example, a long trinucleotide repeat represents a pathological case for synthetic reads, though again some clever informatics might be able to provide reasonable bounds estimated from copy numbers.

The gain from all the extra work, as well as some extra oversampling of the input material can be manifold.  Synthetic long reads can potentially assemble through repeats that cannot be resolved by short reads, though with the caveat noted above. This is also useful for resolving structural variations, such as duplications, in already sequenced genomes. For diploid (or higher ploid) genomes, synthetic reads offer the hope of resolving haplotypes.  In metagenomes, synthetic long reads offer the possibility of reading out sizable fragments of individual genomes as well as resolving closely related versions of dominant species.

Actual long reads from platforms such as Pacific Biosciences and Oxford Nanopore will remain superior for repeat resolution, as they do not have problems with closely spaced repeats, as each repeat will be represented individually in the sequence.  In the extreme case, these systems should be able to (in the case of PacBio have been demonstrated to) able to accurately resolve the number of copies of simple tandem repeats.  The potential advantage of synthetic long reads are the inherently higher accuracy of the underlying data and the enormous throughput of the Ilumina platform.  For metagenomics, as an example, PacBio can only provide high quality sequences of long segments if the same region can be multiply sampled, whereas the 10X system should generate high quality fragments of nearly every fragment that is sequenced.

While a number of synthetic long read systems have been published (such as Complete Genomic's Long Fragment Read approach), Moleculo (now TruSeq Synthetic Long Reads) is the primary one that is commercially available.  10X will require purchase of a ~$90K microfluidic appliance, whereas Moleculo uses only plasticware.  But, 10X is promising the vision of 100K synthetic long reads, whereas Moleculo works in the range of 5-10Kb.

Illumina and collaborators published a clever approach to scaffolding this fall, which leverages a previously unappeciated tenacity of the Nextera transposase for DNA.  Because the transposase actually holds onto and holds together the DNA into which it inserts, dilution of DNA post-transposition enables a subpartitioning of the input genome or metagenome.  However, because this approach involves no amplification, only a fraction of each input fragment will be represented in the final data.  So this approach yields long range information (up to about 100Kb), but because it does not deliver continuous sequence the resulting data can be used only for scaffolding.

Can 10X deliver a product returning on that $55M investment?  One question is how nicely Illumina plays with them -- does the dominant player see 10X as boost to their platform or a drain on reagent sales?  While overly aggressive moves might bring on the dreaded spectre of antitrust issues, simply not supporting 10X on the higher end platforms could make life difficult for the smaller player.  Even without interference from Illumina, the market value for synthetic long reads has not yet been established.  Scientifically, long reads are great, but will large scale sequencing projects decide to spend the extra for this add-on?  Will the 10X appliance fit into megaprojects scaling plans? Conversely, perhaps 10X will mostly eat into the forward motion of mapping players such as BioNano and Nabsys; why map if you can sequence?  A successful 10X launch would probably doom to niche status the cosmid-based long read technology from Lucigen, which is clever but requires making cosmid libraries (which in turn require relatively large amounts of input DNA).

The ultimate version of this convergence would be an all-Illumina strategy to highly contiguous complex genomes, with 10X providing 100Kb reads and HiC-like techniques scaffolding those into chromosome arms or complete chromosomes.  Since strategies along these lines were mentioned frequently in the Twitter stream from the recent Plant and Animal Genomes meeting, that vision may dominate de novo sequencing for the next few years.


James@cancer said...

Hi Keith, Moleculo-esqu ideas were part of a post I wrote around AGBT last year. Other people might come into this space over time if 10X make some headway - and I think there is lots of room for competition.


Keith Robison said...

James: Thanks, but your bitly doesn't seem to quite work -- I believe you are referring to this post Genome partitioning: my moleculo-esque idea. Your comment on cancer genomes is spot on -- every cancer genome really should be a de novo project. I think you've outlined what is probably how 10X works; I've toyed with the ideas for the last 3 years but without the means to execute on them. GnuBio would have been best positioned to do this & perhaps wishes they went down this path, given they were bought out for only about 2-fold of the funding round for 10X, and this is an easier instrument to build (but at a higher sticker price) than what Gnu was building. That is where it gets tricky; the microfluidic merging of droplets.