Thursday, May 23, 2019

Nanopore's Long DNA Paradox

The first half day of London Calling has already delivered the usual mix of scientific excitement.  The prospect of lives, particularly those of children, en masse by delivering precision oncology broadly across sub-Saharan Africa.  Dizzying levels of alternative splicing in a key brain ion channel.  RNA modifications in great numbers.  That one was also gratifying as it was commented that even without a basecaller, modified bases can be be suggested by higher error rates around a particular motif.  Around 2015 or so I noted in the Nanopore Community post that error rates went up around GATC sites in E.coli, the target of Dam methylase.  Protein tags read by the current nanopore scheme. Tonight we get Clive and company performing their usual razzle-dazzle of product announcements; one person pointed out to me a possible angle I failed to include in my laundry list is the raising of the speed limit from 450 to perhaps 1000 bases per second.  But another omission was front-and-center in the plant genomics sub-session I attended and could be called the central paradox of the current state of nanopore sequencing:  pores are great for long DNA but long DNA is not great for pores.

Nanopore sequencing has opened up extreme new channels for DNA sequencing.  Matt Loose's group has seen reads as long as 2 megabases, which is rare, but it is pretty routine for folks to have stray reads in the multi-hundred kilobase range and getting read N50s of 50-100 kilobases seems to be reachable by anyone patient with sample and library preparation.  We are spoiled! It was five years ago this June when I saw my first full length lambda read; just a solo read of 48 kb moved the field forward and now we can get boatloads.

But the catch with really long DNA is that flowcell yield suffers.  Now again, we are spoiled: those early flowcells were hard to get 10s of megabases out of and even a few years ago 5 Gigabases was the province of only the most expert; now 20 gigabases is routine for well-behaved samples.  But other samples are not so polite, and in particular are samples with very long DNA.  It is particularly a problem in the plant world, and one speaker in Q&A even commented that they aren't bothered by the newest Blue Pippen size selections losing the longest DNA because that DNA gives a lot of pore-killing.

Now, there is another trade-off to be made.  Many pores are not dead but simply choked with DNA.  Simple washing of the flowcell can fix some of these and for the rest there is the drastic step of flushing the flowcell  with nuclease.  Either of these requires more library -- which you must have -- and manual intervention with both the time lost and the risk of introducing bubbles.  It is important to note here that 20gigabases per flowcell of total data is often summed over multiple library loads with washing or nuclease flushing; getting 20gigabases from a library simply left to run over 48 hours is relatively rare.  We should be grateful to have such problems, but choosing megabase reads over tens of gigabases of output is still a frustrating trade-off.

How does DNA choke a pore?  Why does ultra-long DNA seem to be worse?  These are mysteries.  For plants, there are strong suspicions that certain contaminants closely mimic DNA and co-purify with it.  For example, one speaker mentioned that treatment with pectinase improved some of his plant samples.  Mixing up purification methods appears to help: phenol chloroform preps cleaned again with CTAB.  More steps means more time and effort and a greater requirement for input material.

All three speakers in the plant session expressed a concern that efforts to sequence large numbers of plant species -- apparently there are multiple initiatives announced targeting thousands or tens of thousands of plants.  One speaker gave mild optimism that each plant family may be its own DNA purification project, but warned also of a case in which a solution that worked on two ecotypes of a plant species failed on a third ecotype.  

With no disrespect meant towards those who have been working on this, I wonder if the field would be accelerated substantially with some additional skillsets.  Nanopore sequencing is dominated by people with molecular biology training or that in specific biological domains.  The efforts to purify DNA could be characterized -- and again this not meant to be pejorative at all -- as tinkering.  Groups are intrepidly exploring hypotheses of the nature of the problems empirically -- will this protocol tweak help or that enzyme and so forth.  Which has been a very useful approach, but I'm thinking of some very specific sort of reinforcements.

Analytical chemists are who I have in mind.  If we suspect specific moieties are problematic, it would be useful to know what those are.  Ultimately only the sequencing pore knows what the sequencing pore cares about, but getting some better intelligence as to what is still in these preps would be useful.  So perhaps we need more folks who think in terms of LCMS traces than nanopore squiggles to explore the composition of prepared DNA and provide a catalog of what contaminants are present so we can design a next set of experiments.  Summon the ghost of Erwin Chargaff and measure things to high sensitivity and high precision.  Chargaff' observation of A:T and G:C ratios required stunning precision for the time.  Conversely,  he would have absolutely hated the entire field of genomics; his autobiographical collection Heraclitean Fire is a must-read but also a window into a very dour character with a dim view of mass biology.

DNA integrity is also important -- nicks and breaks can be trouble.  There's also a hint that some structural motifs may be ruffians -- chicken DNA performs oddly badly in nanopore sequencing and some think it might be to certain motifs common in certain fractions of the DNA.  Schmaltz is schmutz to nanopores?  So perhaps we need also have some experts in exploring the local structure of DNA and those skilled in measuring the number of single-stranded breaks.  Or perhaps somehow convert nick levels into something readable on, (cough), short read sequencers.  

Day 2 awaits!  Time to get moving and get listening again.  

2 comments:

Anonymous said...

It throws open a good opportunity for someone to replace the likes of Qiagen, or for Qiagen to step up to the plate

Anonymous said...

The first thing I thought about when I read this post is the brouhaha around the "Grove's fallacy". At its core Andy Grove's argument is that biology could desperately use the rigor of semiconductor engineering. When an experiment fails in biology it is attributed with alarming frequency to bad reagents. However, efforts to find what is bad about the reagent are rarely undertaken. Such attitudes get you fired in semiconductor fabs. Many people interpreted Grove's argument to mean that a biological system can be designed like a semiconductor chip. I interpreted it to mean that you start by attempting to measure anything and everything about your system. And that starts with rigorous failure analysis. After all, if you can't measure it you can't control it (was it Jack Welch who said that?)