Sunday, January 26, 2020

UST Bets on TELL-Seq

I've made a few references recently to TELL-Seq, both in my flawed analysis of BioNano Genomics (I missed a key business development in their raising $18M in October; I stand by the science comments and fear that the fund raise buys them about a year of time) and on 10X Genomics discontinuing their genome assay kits.  Now to actually dig into that technology -- a bit late given the preprint came out last fall, but better late than never.  So put on your sunglasses and hoodies, conjure up the image of early television chefs and key up the theme music for The Lone Ranger, because here I go.
Long range DNA information matters in genomics.  There are two general approaches to acquiring such information: either generate information directly on really long DNAs or use clever strategies to encode long range information into short reads.  In the former category we have Oxford Nanopore, Pacific Biosciences and BioNano Genomics; in the clever packing comes a long line of technologies starting with jumping libraries which evolved into mate pairs, with later arrivals linked reads (aka long fragment reads or LFR) and HiC.  Jumping and mate pair technologies took huge physical quantities of DNA and converted them into small amounts of long range information because each long input was converted into effectively one bit of long-range information (though they did preserve orientation).  Linked reads were a huge improvement over these in that they extracted much more linkage, albeit losing the orientation information. 

But the early LFR technology from Complete Genomics gave modest levels of information and was always restricted to that platform.  As I wrote then, the basic idea to to partition high molecular weight genomic DNA into partitions -- Complete used 384 well plates.  Amplification of the DNA by MDA provides many copies.  Libraries from each well are sequenced, and a computational process not unlike traditional linkage mapping is used to infer which fragments are near each other.  The method actually harkens back to a process called HAPPY mapping which was proposed to use isolated single sperm to perform the mapping.

Illumina and academic co-authors published a transposon-based scheme called Contiguity Preserving Transposition, but never commercialized it as a kit. In this method, again one partitions DNA into wells on a 384-well plate, but each well is tagmented with barcoded transposons, one barcode per well.  But tagmentation doesn't actually break up the DNA right away -- the transposase itself holds things together -- , so if you then mix and split these wells into new 384 well plates you can get another level of tagging -- there are 384 squared possible partitions.  You could imagine repeating the split-and-mix cycle more times, limited only by your ability to read the compound barcodes.  This approach is used in a number of clever single cell methods.

Another approach is to actually generate huge numbers of physical partitions through microfluidics; this is the approach commercialized initially by 10X Genomics.  In the ideal case one will have libraries made from single molecules, though in these schemes.  iGenomix launched a competing kit, only to have the rug yanked from beneath them when the necessary droplet hardware disappeared as part of BioRad acquiring RainDance.  Which points to the problem: you need an expensive piece of hardware to perform this mode of linked read generation.  Said hardware is also a bottleneck if you want to run many such assays.

The last few years have seen a number of publications (and probably even more patents) of methods for single tube linked read generation.  These all are similar-but-different twists on using DNA-barcoded beads.  Each bead contains only a single barcode, which is somehow linked to library fragments. In some schemes, what's actually on the bead is a transposon carrying the barcode, though as we'll see there's another way to do this.  If one can produce large numbers of such beads -- the original Illumina+U Washington 2017 publication used 150K barcodes.  BGI/MGI/Complete Genomics upped this to 50 million barcodes with their single tube long fragment reads (stLFR) technology, which is offered on their sequencing platforms. 

TELL-Seq (Transposase Enzyme Linked Long-read Sequencing), for sale by startup Universal Sequencing Technologies, is another transposons-on-beads scheme.  Barcoded transposons on microbeads add an initial tag to identify the molecule plus one of the Illumina priming sites (Index 1).  A second transposon brings in the Index 2 priming site.  PCR of the library enables a single library-specific tag to be added and read with Index 2.  The entire process takes place in a single tube with only a few reagent additions and a final bead cleanup and is claimed to required 3 hours of wall clock. 

It's interesting to contrast this scheme with the stLFR scheme, which performs a transposition in solution.  Beads with barcodes are added and a ligation tags one end with one type of adapter.  A separate ligation with in-solution oligos gives the other adapter.  I've left out some other steps, such as some digestions to eliminate unused adapters.  In any case, the stLFR authors claim that this approach gives them higher numbers of fragments from the same original DNA molecule.  Interestingly, they believe that a limit they see in overall DNA molecule size may be based on the diameter of their beads and the number of times a given HMW molecule can wrap around.  The TELL-Seq protocol definitely has fewer steps than the stLFR protocol; it would be very interesting to have an independent party run each on the exact same DNA material to see how they perform head-to-head.

TELL-Seq requires a small amount of customization of Illumina sequencer process, so you probably won't be mixing these samples in with anything else.  For starters, the bead-specific Index 1 barcode is 18 bases long, much longer than a typical 8 or 10 base index barcode.  In accordance with the great principle of Dr. Tanstaafl, that means one or both non-index reads will lose some cycles.  Unfortunately, it doesn't seem possible to also sneak a sample specific index into Index 1, so combinatorial barcoding of samples is not possible in the current framework.  On the NextSeq a supplied custom Index 2 primer must be used in place of Illumina's.

One interesting quirk of TELL-Seq kits is that you adjust for genome size by how many beads you add. Bigger genome, more beads and more partitions.  The supplied protocol divides genomes into three size categories.  So the ~$900 kit is good for 12 microbes but maybe only 4 large vertebrate or plant genomes, and so your library cost will vary accordingly.  In any case, it is a welcome change from 10X, as their protocols were only for large genomes and there really hasn't been much work on adapting them to small genomes.

For software, there are several options.  TELL-Seq's developers collaborated with Pavel Pevzner and colleagues on cloudSPAdes, a linked-read enabled version of that wonderful package.  There's also 10X's SuperNova and Long Ranger, both used in the TELL-Seq paper.  They also reference Turing Assembler from BioTuring Inc, available as a binary for Linux (apparently free, for the supplied link downloads.

The TELL-Seq preprint demos performance of TELL-Seq first by assembling with TuringAssembler four different bacterial samples: E.coli DH10B, E.coli MG1655, Campylobacter jejuni and Rhodobacter sphaeroides.

The latter two give some testing of extremes of %GC, with 30.56% and 68.51% respectively -- not quite the GC-evil (<30 amp="">70) samples we routinely play with at the strain factory, but not terrible choices.  They also try two different input amounts for DH10B of 0.5 and 0.1 nanograms.  This does illustrate the fact that TELL-Seq demands very low amounts of input DNA.

Table 1 of the preprint details the very good results, though I wish they had broken out separate contig and scaffold contiguity statistics.  Two of the assemblies, MG1655 and C.jejuni, appear to be essentially near perfect -- no Ns, contigs the same general size as the reference.  They also sequenced Coprobacillus cateniformis DSM-15921, a human gut microbiome sample.  This also gave very high contiguity.  However, the R.sphaeroides assembly had many smallish contigs.  One note by the authors is this was commercial column-purified DNA rather than more carefully purified HMW DNA used in the other experiments, and so it wasn't the best test.  Or perhaps it was, showing how TELL-Seq might underperform if you use more typical DNA purification methods.  DH10B proved a bit troublesome as well, as it has a very large duplication.

The cloudSPAdes paper has three additional TELL-Seq assembly attempts, from Saccharomyces cerevisiaeStaphylococcus aureus and again DH10B.  An interesting comment in the paper is that several other linked reads assemblers, Architect, ARCS and Athena, use SPAdes without barcodes as a first stage but don't fully take advantage of information it generates -- they use barcode information to link contigs but don't use the assembly graph generated by SPAdes.  CloudSPAdes did not do as well on DH10B as TuringAssembler did in the TELL-Seq paper; it would be interesting to compare them on the exact same input dataset. Still, the largest E.coli alignment was 2.5Mb in length.

To demonstrate utility on large genomes, TELL-Seq was applied to two Genome in a Bottle (GIAB) human reference samples, NA12878 and NA24385.  Each sample yielded over 7 million barcodes and 90% of the linked reads indicated source molecules of over 20 kilobases in length and 20-30% over 100 kilobases in length.  Haplotype phasing succeeded for 99.8% of heterozygous SNPs

GIAB has been aimed at comparing and calibrating methods.  On chromosome 3 (162,512,134-162,626,335) there is a structural variation region of NA12878 that has not been consistently called by different methods.  In the TELL-Seq paper, they note that one publication using 10X identified a 114 kilobase heterozygous deletion whereas the stLFR paper called it a homozygous 19 kilobase deletion.  The conclusion in the TELL-Seq paper is that it's the homozygous 19 kilobase deletion within the region carrying a 114 kilobase heterozygous deletion.  The 114 kilobase deletion is considered a gold standard >100kb deletion in NA12878 but is not always detected by array methods (see Table 1 of Zhou et al)

Another discrepant area of NA12878 is a deletion of chr5:104,431,113-104,503,673 -- the 10X paper calls it heterozygous and stLFR paper homozygous.  The TELL-Seq authors give qualified support for a heterozygous deletion but remark that coverage was low in this region. 

de novo assembly of the NA12878 data with SuperNova is claimed to be superior to assemblies generated with other linked read approaches.

Okay, I'm getting lost in some weeds & I'm not as facile looking at human genome data.  Something to work on.  But I think the overall conclusion is that for Illumina platforms TELL-Seq looks like an intriguing linked read option requiring no special hardware and seemingly automation-friendly.  Ideally someone out there would perform proper head-to-heads of TELL-Seq vs. Oxford Nanopore (both R9.4.1 and R9.5) and PacBio (both Continuous Long Read and Circular Consensus Sequencing) with the same input material (ideally for NA12878!) to truly see what the tradeoffs are.  I doubt we'll see such a 5-way comparison anytime soon, but if anyone could knock off one pairwise comparison that would really move the needle. 

UST is a company to watch because on top of the TELL-Seq product they are also making noises of developing their own long read sequencing platform.  In addition to the technical challenge of developing a new platform, there's the attentional challenge of running both an operating product line and the development of something radical. 

No comments: