Monday, July 29, 2024

Musings on Possible Fixes To PacBio & ONT's Achilles Heels

I recently tried to place a claim that I had first conceived Oxford Nanopore's "6b4" strategy for solving homopolymers, but that appropriately brought a number of citations for the concept that predated my blog piece.  Not one to give up easily (and as hinted in that piece), I'm going to spend part of this piece trying to stake claim on some new concepts for fixing Oxford Nanopore's homopolymer issues - and PacBio's trouble with polypurine stretches.  To be honest, much of this piece will consist of me posing questions I haven't bothered to try to chase down if they've already been answered in the literature.  But not only might someone do that, but it may well be that data already exists in the public sphere to explore proof-of-concept!  But I haven't checked that either - though doing so was on my list of "what to do if management gave me the summer off" - but they didn't.

Recapping The Problems

ONT has made remarkable strides in improving basecalling accuracy over the last decade, but homopolymeric stretches still bedevil them.  Homopolymers are simply too monotonous in signal to enable consistent measurement of the length of the homopolymer.  One approach to addressing this is to dope in a modified base, so that homopolymers are no long stretches that are purely a single nucleotide; modified versions of the nucleotide will alter the signal.  6b4 accomplishes doping by resynthesizing DNA using a mixture that contains both adenine and an adenine analog, as well as thymine and a thymine analog.  At this time, it is poly-A/poly-T that ONT most worries about; a similar strategy could address poly-C / poly-G but they aren't as concerned for those since they are less frequent in the human genome.  By adding some variety in bases to normally homopolymeric stretches, the monotony of the signal is broken up and better basecalling is enabled.

PacBio has difficulty with polypurine stretches; one theory is that the polymerase falls off of these. As a result, variant calling in polypurine regions is poor due to low coverage.  I haven't seen anyone try a '6b4" type strategy for these with PacBio.

Alternative Doping Strategy: EcoGII

I recently covered a range of related techniques for marking open chromatin so it can be read with long read sequencing.  My bibliography in that piece is now at least 1 or 2 references deficient due to new publications from the Stergachis lab.

Most of these methods rely on EcoGII, a methyltransferase available from NEB in their nifty Enzymes for Innovation program which appears to methylate adenine in any context.

An interesting side question here is whether EcoGII shows any degree of cooperativity or anti-cooperativity -- if EcoGII methylates one A in a homopolymer, is it more or less likely or neither to methylate an adjacent A.  This is one of the questions that might be answerable with the existing long read chromatin profiling data, if any is available on SRA.

We know that doping in some 6mA helps with ONT's ability to resolve homopolymers.  Of course, EcoGII would solve only homo-A and there aren't (yet) known enzymes covering the other three bases with a similar lack of context specificity.  But the advantage of 6b4 is a much, much simpler bit of biochemical workup - rather than running an entire polymerase reaction just treat with a methyltransferase.  Of course, getting the right reaction conditions would require exploration - one wants to sprinkle methylation not methylate every base.  Again, the data to start exploring this might already be in SRA! 

Could it also help PacBio with G,A-rich regions?  I don't know - and I doubt anyone knows unless they've tried it.  But again, the data might exist in SRA! One could look to see if G,A rich regions which bear 6mA marks show higher coverage than those that do not

Alternative Doping Strategy:  Intercalators

One of the other ways to mark open chromatin for sequencing readout is the intercalating natural product angelicin, which when exposed to appropriate light covalently binds to DNA.  This would have the potential advantage of solving all four types of homopolymers for ONT.  There's only one preprint with this method (which looked at the yeast Saccharomyces), but it does say the raw data is available - so it may be possible to compare basecall accuracy on marked and unmarked homopolymers.

There isn't any such paper with PacBio data, so no idea if this would improve polypurine performance - someone should perform the experiment!

Could more conventional intercalators have an effect?  Having a covalent linkage is probably very desirable for PacBio, so that each pass of the circular consensus sequencing encounters the same adduct.  Trying a bit of ethidium bromide or Sybr Green in a nanopore reaction is certainly within the reach of every lab with ONT hardware.  

 It also raises some interesting questions for dyes that supposedly preferentially bind double stranded vs single stranded DNA.   If DNA is being unzipped by a helicase or polymerase, how do the kinetics of those enzymes compare to the kinetics of the intercalating agent leaving the DNA?  

Get to Work!

So I've set out a bunch of questions and suggestions for data exploration - who will take up the task?  There's some that might be answered with existing literature, some questions that might be asked of existing datasets - and most of all some new experiments that could be tried.  Even if these ideas don't pan out, just having datasets of raw ONT signal of libraries treated with different DNA-binding dyes might be inherently interesting and perhaps stimulate new research directions.




1 comment:

Anonymous said...

This is another one you're a bit late to the party on I'm afraid Keith. ONT looked at EcoGII many years ago as I believe Clive talked about it in one of his web talks.