Monday, November 06, 2017

A Nucleotide Mixture-Based Error Correcting Short Read Chemistry

Sometimes polony-style short read sequencing seems like old news.  The underlying technology has been commercially available for over a decade.  I focus much of my attention to gains in long read technologies, though incremental improvements to read lengths or polony densities still appear.  Now in Nature Biotechnology a group from Peking University has published a new twist on sequencing-by-synthesis that is claimed to offer significant improvements on read accuracy.

The new paper describes a clever way of using multiple passes of cyclic sequencing-by-synthesis with unterminated nucleotides to improve error rates.   Ion Torrent and 454 are the two commercialized approaches that would fit this classification.  But rather than each flow using a single pure nucleotide, the flows use pairs of nucleotides.  So, for example, the first round of sequencing might alternate R (A+G) and Y (C+T) flows.  The product is them chemically denatured with formamide from the template, a new primer annealed and a new round of flows performed.  But round two might alternate  S (C+G) and W (A+T) flows.  After another denaturation and a new primer annealing, a third round of flows with K (G+T) and M (A+C) mixtures would be used. 

Now, the paper posits these could be run on using electronic (Ion Torrent) or pyrosequencing (454) detection, but for their proof-of-concept they used a monochrome fluorescent chemistry.  A new fluorophore linked to the nucleotides by a tetraphosphate is used.  After release by polymerization, the fluorophore does not function until the phosphate is clipped from it by alkaline phosphatase.  This yields a bright, stable fluorescence pattern to be imaged.

I'll confess that when I first read the paper I was confused on one point: how do you avoid disturbing the pattern while flowing the reagent over the polonies?  The solution is clear from the rather encyclopedic supplemental material.  The polymerase used, Bst, is essentially inactive at 4C, so the flowcell is chilled prior to reagent delivery and then warmed after reagent addition to trigger polymerase activity.  The flowcell is chilled to 15C prior to imaging the stable fluorescent pattern.  After imaging,  the flowcell is chilled again and the next round of reagent is pushed in. No specific wash step is specified.

Now, with mixtures the problems of homopolymers is amplified -- instead the problem is now what the authors call degenerate polymer length (DPL) arrays.  With the stable fluorescence pattern the authors show linearity of the signal for DPL arrays of up to eight.  The authors also show that only a small fraction (~1.15%) of the human genome is in arrays longer than eight, so this should be sufficient.  In their tests using lambda DNA and a laboratory setup (and a reused Illumina flowcell!), the raw error rate was estimated to be 0.18  in the first 100 nucleotides and 0.55%  in the first 200 nucleotides

Why use three rounds of degenerate mixes?  The advantage is the opportunity for error correction.  Two such round give sufficient information to decode the sequence, so the third round has the opportunity to enable error correction.  The calls from each flow is converted bit string, with one mix as 0 and the other as 1 (and the intensity converted to the length of a run of the appropriate binary digit).  By choosing an appropriate mapping of mixes to bits, an interesting property emerges.  For each position, if the number of 1s is odd then that is a potentially correct set of calls, but if the number of 1s is even then that is definitely an error.  A parity check.  This corresponds to a 3-input logical XOR, which I hadn't bumped into before -- but it's simply two successive XOR operations -- XOR(a,b,c) = XOR(a,XOR(b,c)).  Or as one online reference pointed out, if you map bits to signs, then XOR is the equivalent of multiplying signs -- odd number of minus signs comes out minus!

An additional layer of estimation is described as a potential future direction, though not demonstrated experimentally.  If each dibase mixture has the two nucleotides labeled with distinguishable fluorophores (dichromatic labeling), then additional information in the form of the amounts of extension by each nucleotide would be obtained.  

An interesting property of these dibase mixtures as opposed to typical single nucleotide addition schemes (such as 454 and Ion Torrent) is that there are no dark cycles.  For example, consider the sequences ACGT and CGTA if the flow order is T,G,C,A; the former won't extend until the 4th flow and the second will extend on the 2nd and 4th flows.  The other flows are dark.  But by alternating two base mixtures, every flow is expected to generate signal on every template.  Hence, random sequence is expected to advance on average two bases per cycle, whereas with conventional systems the advance averages at 0.67 bases per cycle.  

With accurate DPL estimation and now identification of clear errors, which implies that at least one flow's DPL length wasn't correctly estimated, it is now possible to estimate the locations of the errors using dynamic programming.  The question is where to insert gaps to realign the flows so that they are in agreement.  Instead of the typical sequence alignment scheme of gap open and gap extension penalties, probabilities of each path are estimated using an application of Bayes theorem.  The prior for DPL of length n is 1/(2^n).   With this approach on the test lambda libraries the authors eliminated all errors in the first 200 nucleotides (sample size=8610 nucleotides).  Error correction reduced errors in the first 250 nucleotides to 0.33%.

Given such claims, my mind immediately leaps to what might be pathological cases.  Clearly very long homopolymer runs are very bad, as the error correction properties will be blunted.  For example, imagine a run of 20 As.  That will show up as as some signal -- perhaps badly estimated -- in R, W and M flows.  

Perhaps more relevant are long dinucleotide repeats.  These create a very long DPL for one mix, but will be very short in the other flow rounds.  Truly long DPLs may be seen in some cases.  For example, there is a CA repeat in the EGFR gene which can have allele lengths in excess of 35 copies and may be relevant to tumor histology and patient survivor.  One of those large arrays would generate a DPL of length 70 or greater in a K flow.  Perhaps the biggest issue is that all of the other flow sequences will take 70 flows to cover that same distance, so in a 100 flow experiment only a small number of the templates will get through the entire array.  Of course, the same could be said for an Illumina read of only 100 bases; long arrays are just hard to sequence.  

Simple repeats of longer unit length will cause similar problems.  Any  unit which corresponds to to a single dibase will be particular trouble, such as AAT or CCGG repeats.  This also illustrates an issue with the authors casual dismissal of only a bit of 1% of the genome consisting of DPLs longer than 8 -- those DPLs may be enriched in regions of high biological or clinical interest.

Since this is an ensemble sequencing approach, dephasing of each polony is an issue.  In the Supplementary Material the authors develop a dephasing algorithm, which almost certainly deserves to be published as a separate paper.  The supplement even includes Matlab code for a "virtual sequencer" to simulate the dynamics of dephasing.  An interesting property of dephasing with dibase mixtures is that an inappropriate extension by a contaminating base can trigger further extension by the correct bases.  For example, consider a flow with S(C,G) which contains minute amounts of A and T.  If a sequence is CATCT for an S flow, then essentially all will incorporate a C.  But a small fraction will extend with A - -but then won't extend because contaminating T is rare.  On the other hand, a sequence CACCT would see further extension of the latter Cs should the T be extended.  The authors call this the "One Pass, More Stop" principle and use it to estimate the distribution of DNA extension lengths and develop a dephasing algorithm.  Unfortunately, this section doesn't reference any prior work on dephasing algorithms and that isn't an area I can comment on.  Again, this section deserves the fuller exposition and comparison to other literature which a separate paper would enable.

The one shocking gap in the paper, and clearly a failure on the part of both the authors and the reviewers, is the claim that this is the first sequencing-by-synthesis approach to use multiple reads of the template to feed an information-theory based error correction scheme.  The SOLiD Exact Call Chemistry (ECC) is such an example which should have been cited and compared against the error correcting ECC scheme (popular abbreviation!) in this paper.

The authors suggest that their error correction scheme could prove useful for detecting very rare variants, such as the problems of rare somatic mutations in cancers or detecting rare fetal variants against the background of contaminating maternal DNA in a non-invasive prenatal testing setting.  In many of these settings, an effective mean read length of 200 bases (using 100 flows) should be quite adequate, given the frequently fragmented nature of FFPE or cell free DNA.

As for speed of operation, Figure S4.2 suggests an overall cycle time around 2 minutes.  So 100 flows might be completed in less than 4 hours.  This suggests their approach could be a contender for applications requiring very rapid sequence acquisition, such as infectious disease diagnosis.  

Some of this will depend on what hardware this chemistry ends up running on.  It would appear that only a small amount of modification would be required to run this chemistry on any existing optical sequencing-by-synthesis apparatus.  The authors note it could be grafted onto Ion Torrent or 454, though the lower accuracy in DPL signal estimation would reduce the overall accuracy of this approach on such platforms. They curiously suggest the chemistry could work on PacBio, but that would require switching that platform from a continuous method (with all nucleotides present) to a cyclic chemistry. 

Is there a sustainable niche for a more accurate short read chemistry?  Given the dominance of Illumina, it might be hard to get a toehold.  Conversely, perhaps an existing second-line player would be interested.  It would be particularly interesting if this scheme worked with SeqLL's single molecule sequencing-by-synthesis instrument, improving the error rate of that platform and perhaps also improving the read lengths.  In any case, many more sequencing chemistries are published than commercialized, so we'll have to wait-and-see whether this one gets out of the gate.

1 comment:

Unknown said...

Thank you very much for your analysis of this methodology.