Friday, March 25, 2016

Selective sequencing: A Programming Opportunity!

I ask a bit of indulgence from my regular readership for this piece, as I am going to explain a number of things in depth that probably will be very familiar to them.  My hope, perhaps fantastic, is that this piece will get out to some who are not so familiar with such topics, as I think the problem at hand might be very fascinating.

Rapid advances in DNA sequencing technology in terms of capability, throughput and cost effectiveness have driven a wide array of changes in biology and medicine.  This is even starting to push into the public consciousness via mail-in genetic test kits or sophisticated at-home biology experiments.  As has been repeated ad naseum, advances in sequencing capacity have meant that the cost per unit of sequence (nucleotide bases aka bases) has been dropping faster than the pace of Moore's Law on the compute side.

The main driver of these advances is what is sometimes termed "massively parallel sequencing" (unfortunately more commonly called "next generation sequencing", which begs the question of what to call any later advance).  A typical setup is that the sequencer contains an array of DNA molecules, which can be either specifically patterned or simply random locations with sufficient separation.   These spots may be occupied by truly a single DNA molecule, or an ensemble of exact copies that serve to amplify the signal. 

Most of the advances in sequencing capacity have come from increasing the number of molecules in a given run via increasing the density of spots in the array, to millions or even billions of spots per device.  Most of these technologies read the spots synchronously in a series of cycles, with each cycle reading one more base from each spot. In the case of the ensemble methods, keeping all the molecules in lockstep proves impossible over very long numbers of steps, so these methods are limited in the length of DNA they can read in each fragment to hundreds of basepairs.  

Single Molecule Long Read Sequencing

A few newer technologies place individual DNA molecules at each location and sequence them asynchronously -- data simply keeps rolling in from each location.  The physics of these systems are quite mind-blowing.  For example, Pacific Biosciences SMRT technology involves illuminating sub-wavelength nanomachined wells so that individual enzyme molecules can be spied on in real time.  Nanopore sequencing  involves measuring current differences from a single hole (pore)  in an insulating membrane as it is partially occluded by the molecule subunits of a single DNA molecule passing through it.  Because these systems do not require keeping ensembles of molecules in lockstep, the length they can read from a single input molecule can be tens of thousands of bases or even over a hundred of thousand bases.  Limits on reading are set by new issues, such as the challenge of keeping a huge, floppy DNA molecule from simply flopping and kinking itself into breaking.

A key characteristic of most of these technologies is the set of DNA molecules to be sequenced is fixed at the start of the run.   DNA is in some way attached to a solid component of the sequencer, and that DNA will be at that location throughout the run.  So while you might be tempted to peek at results before the run is complete, and that could be useful to knowing the answer sooner, one cannot productively alter what will be sequenced.

There is now a single molecule sequencer family for which this is not the case: the Oxford Nanopore MinION and PromethION instruments (with MinION actually in the field and PromethION expected to begin shipping to beta sites imminently); they differ only in scale.  These instruments have very small numbers of active sites -- at most 500 in the current incarnation of MinION, but the active pores on their chips have a duty cycle.  Correctly prepared DNA molecules are grabbed by a pore and run past the sensing element.  When the end of the DNA molecule is reached, then the pore is now available for a new DNA to dock and repeat the cycle.  

In an imminent incarnation, DNA is read by the pores at nearly 300 bases per second; the future ceiling for this speed is claimed to be in excess of 1000 bases per second.  There's also a twist on this called 2D sequencing, in which some biochemical trickery has been used to connect the tail end of the strand first sequenced by the pore with the head end of the complementary strand.  That linkage is referred to as a hairpin (because if you draw it out, that's what it looks like).  The advantage is that by reading both strands of DNA and combining the information, a higher accuracy can be acheived.
Since the MinION is relatively inexpensive ($1000), the size of a USB device and requires little other support hardware, a very real possibility of taking DNA sequencing anywhere in the world is opened up.  This could be highly valuable for tracking emerging diseases such as Zika or Ebola, as well as useful for building lightweight scientific infrastructure in developing nations.  But, the capacity of MinION is small compared to many of the other devices, so how can we use it effectively?

Selective Sequencing

What makes this technology particularly unusual, and the focus of this article, is that a given pore can be instructed to nearly instantaneously eject a DNA molecule it is processing.   In a colorful description coined by others, the system can "taste" DNA and spit out those it finds unpalatable.  The pore is now available to grab another module, and taste that molecule and so forth.  This  "read until" approach was recently demonstrated empirically by Matt Loose and colleagues.  A small catch: in 2D mode, spitting is only permitted before reaching the hairpin; once the pore has swallowed the hairpin it is committed to reading the rest of the molecule.

That opens up a whole world of possibilities -- but also a requirement for rapid computational decision making rarely seen in bioinformatics.  If the DNA is translocating at 280 bases per second, then a 2800 basepair fragment is only 10 seconds long! 

Barcodes, Amplicons and Fragments

What sort of things might we be tasting and why?  At a crude level, we can compress the world of possible DNAs into three categories: barcodes, amplicons and fragments.

Barcodes are human-designed and synthesized sequences used to tag different samples, so that multiple samples can be run simultaneously.  Typically short (perhaps 10-30 bases), simply because our limited ability to inexpensively make DNA.  However,these can be designed with purpose, and so there is a large (and still growing) body of literature on how to design barcodes to avoid confusing them. Barcodes can also be applied multiply, either tagging different ends of a molecule or applying multiple barcodes in series.  For example, iIf we have our samples organized in the common format of plastic plates with 12 columns by 8 rows of wells, then one barcode might indicate the plate, another the column within a plate and a third the row. In some cases, a set of barcodes is the entirety of the DNA to be read, but more commonly barcodes are attached to amplicons and fragments.

Amplicons are DNA segments in which we know, at least  to a close approximation, each end of the DNA and probably have a strong notion of what is in-between those ends.  In some cases we might know it exactly, with the goal being to count the number of occurences of that fragment in a sample, but more commonly we care about small differences within that fragment.  For example, the human gene KRAS is mutated in approximately 50 % of human cancers and mutations in KRAS can help inform treatment.  The majority of KRAS mutations in cancer occur in 6 consecutive base positions, so using the technique of PCR we can generate amplicons that encompass those 6 positions.  

Fragments are simply random fragments of DNA -- we extract DNA from some source and break it into pieces, or perhaps (for this piece) we treat as a fragment those cases in which a DNA (or RNA) is small enough that the intact molecule can be sequenced in its entirety.  

Balancing, Filtering and Needle Finding

Why might we want to utilize read-until?  The use space really runs from two extremes: balancing and needle-in-a-haystack.  I'll call the middle of that range filtering.

In balancing, we want to make sure we have sampled effectively from a set of known molecules.  For example, if we have a set of amplicons from multiple samples (indicated by barcodes), we might wish to ensure a uniform sampling of each of these, despite a lack of uniformity in the input.  

With filtering, we expect in advance that a small number of molecular species will be grossly over-represented.  For example, if measuring a bacterial cell's current genetic activity state, we want to look at RNA.  However, the vast majority of the RNA in the cell is ribosomal RNA, which isn't informative.  That ribosomal RNA may be 99% of the molecules we see, but is entirely chaff from our point of view.  Another example of filtering might be if we have cloned DNA within an E.coli host.  The cloned material might represent 1% or less of the total DNA, with the background E.coli effectively overburden we must sift through.

At the extreme, we can imagine cases in which we are truly looking for a needle in a haystack.  For example, an exciting possibility for the rapid diagnosis of disease is to draw biological fluids from a patient (such as cerebrospinal fluid or pus) and sequence them.  However, most of the extracted DNA will be human.  The faster you can find the rare pathogen DNA in the sample, the sooner the patient can be put on the appropriate therapy.  In another application, if you are sequencing all the DNA found in a soil sample, you might keep track of what you've seen repeatedly, and try to sequence in the future only that which is novel.  

A key distinguishing characteristic of these three modes is the size of the universe that a read-until algorithm might search through.  With barcodes, the universe might be only 10^3 or 10^4 bases in total.  Amplicon sets might be more like 10^5 or even 10^6.  Bacterial genomes are in the 10^6 range, but humans are 10^9

The Catch

What I've left out so far is a serious catch.  DNA is beautifully digital in nature, most commonly just 4 simple letters.  Unfortunately, the nanopore devices don't easily generate such data -- at a molecular level the pore is interacting with multiple bases in a row.  Even if just 3 bases, 4^3 is a lot of electrical levels to correctly disambiguate.  So various algorithms, from Hidden Markov Models to Recurrent Neural Networks are used to convert the electrical signal to bases, and even after that is done the basecalled sequence may have an accuracy in the 65-85% range (this in general gets better with improvements in the chemistry and basecalled, but error rates are still significant).  So, either one must wait for a slow basecaller to run (while the DNA continues to traverse to pore) or develop ways to a ascertain whether a sequence is desired (or not) based on the original electrical signal. For example,   As Doc Daneeka once said, that's some catch.

However, the prize is definitely worth the game. Make the call quickly enough, and substantial savings should be possible.  For example, if the input DNA has a mean length of 15,000 bases (or about 100 seconds) and you can eject unwanted DNA after 10 seconds, that's quite a save and should enable a far more sensitive needle-in-a-haystack search.  

[added after some Twitter feedback]
But you'll need the right algorithm, with execution speed a critical factor -- run too long and the opportunity to make a useful decision is lost.  Loose's group used Dynamic Time Warping (DTW) to compare the signal from the sequencer to simulated signals from reference sequences. Unfortunately, as they note, DTW scales as the product of the lengths of the input and reference sequence, so DTW may be unsuitable for the needle-in-a-huge-haystack problems. The human genome is about 100,000 times larger than the PCR amplicon set they were using, so scaling could be a real issue if you are trying to filter out human sequence.  Even the E.coli filtering problem represents about 250-fold more complex a target set.

[expanded and edited]
Which brings up the issue of precision and recall; the nanopore data is certainly not error-free, with several error modes.  The appropriate balance between precision and recall will be application-specific, but in general I'd say that poor recall (false negatives) are going to be the more serious issue, particularly when looking for very rare events:  If the event you are looking for is one in a million, you can't afford to call it wrong once.  On the other hand, letting through extra chaff hurts ones efficiency, but doesn't compromise the overall goal -- as long as the filter is sufficiently stringent to make a difference. 

If the speed of the nanopores continues to increase, then that could change when read-until actually makes sense.  For example, if the pores really do someday run at 500b/s, and most PCR amplicons are only 2500 bases or shorter (which is quite common), then the amplicons are only 5 seconds long -- not much time to make a decision.  Still, even at 1000 bases per second a 50 kilobase molecule would be nearly a minute long, and so could be useful to avoid if it is an uninteresting 50 kilobase molecule.

Come Plow Some New Ground!

If you're a programmer type, here is why I think you should get excited about this problem and start working on it. First, advances in this space could have a sizable impact on point-of-care healthcare, which could change, for example,  how we decide which antibiotics to use for infections (or whether to use them at all. But the other side of the coin is that this is a nearly un-plowed field in bioinformatics, and we have far too many with absurdly deep furrows.

Take, for example, if you wish to align the output from the market-dominant Illumina machines to a known genome.  Somewhere in the neighborhood of 100 different programs have been published for this task (another blogger ranted about this situation a few years ago, when there were only about 70).  Very few of these are in wide use, because they are particularly good or work well under some special condition (such as very low cost, low memory hardware).   Many were mostly demonstrations of this algorithmic idea or that piece of specialized hardware.  Many were developed simultaneously.  Some probably don't have any clear reason for existing.   Few are published with truly solid benchmarks; all too often the benchmarking is either absent or against some set of obsolete aligners of no interest to the community.  

And that's just one problem.  We have fleets of de novo assemblers, and again there is a small set which have ever seen much use and rarely does a new program truly improve on the best of the existing.  There are long queues of programs to do paired end merging, in which you have two sequences that map overlap at their ends by a tiny amounts. This is useful but incremental to other processes.  Hordes of tools to trim the ugly bits from the ends of reads. And so on. 

But here is a truly demanding problem in which there probably will be multiple optima.  Some will wish to do this in a well-established laboratory in the developed world, and so having a power-hungry bank of GPUs will be an option.  Others will want to be sequencing field samples by the light (and perhaps power) of a campfire, and will need solutions that use no more than a laptop, tablet or cell phone for compute.  

If you do decide to wade into the field, the other mistake I urge you to avoid is to build your code from the ground up.  This is the other issue with all those similar programs: they're nearly all built on their own custom platforms.  Nice would be to just have consistent command line argument or configuration file conventions; better would be if these tools actually improved each other.  But that is rarely the case.  If you like Python, there is Poretools, and if you like R there is poRe.   Java aficianados already have two packages to choose from (can one can hold out hope for a future merger?), HPG Pore and npReader. Alas, if you like Julia there isn't anything there, though the first user-created code ever released to the Oxford Nanopore community was written in Julia. 

So, anybody want to dive into the brand new world of real time selective sequencing? 


Nick Loman said...

On the subject of poretools -we do provide an API that might be helpful for people writing code in Python. A unified interface across all tools is a nice idea, but may be difficult to achieve in practice.

Geneticist from the East said...

Thanks for another blog on this subject.

I have read the Loose preprint but I still not quite sure how selective sequencing works.

Is the throughput of a Minion sequencer limited by a certain number of bases it can sequence? Suppose we have a Minion that can sequence at most 1000 bases and has only one pore. We also only consider 1D reads. There are seven reads of lengths 50,101,145,222,345,137,310 going thru the single pore in this order. Then without selective sequencing, we should get them all except for the last read.

Suppose we now read the first 50bp to see if it contains a 30bp sequence we are interested in. If the first 50bp doesn't have it, then we reject the read. Assume only the 3rd,5th,7th pass the rejection test, then we would have sequenced 50,50,145,50,345,50,310 and the selective sequencing approach allows us to sequence the 7th read. Is this how it works?

Thanks a lot in advance.

Keith Robison said...

The MinION, as far as can be determined, has no limit on the read length. What selective sequencing / read-until is attempting to maximize is the useful sequence which is read.

It all gets back to this concept of duty cycle. A given pore grabs a DNA molecule, processes it, and then is available to process another DNA molecule. The overall throughput of the sequencer in a given time is the number of bases read -- but the useful throughput is the number of those bases read which matter for the experimenter. So if we can determine reliably that a molecule is not of interest, we can try to cut down the number of bases read which are not of interest. Selective sequencing is attempting to increase the yield (and yield per unit time) of interesting information by actively avoiding sequencing uninteresting DNA.

In selective sequencing, the goal is to determine quickly which molecules are uninteresting (either overall uninteresting, or simply redundant given what has already been read). By deciding early on to eject a strand rather than continuing to sequence it, the pore is available earlier to start reading another molecule -- we've sped up the duty cycle, at the cost of throwing away (not reading) data. But we determined we didn't care about the data we are throwing away is of little/no interest, so that is a net gain.

Geneticist from the East said...

Thanks Keith for your clarification.

So the main advantage is to get the interested reads as soon as possible.

As to the overall output from the sequencer by doing selective sequencing, I think we can expect higher proportion of reads that we are interested by sacrificing some overall throughput wasted on rejected reads.

Does that sound right?

Keith Robison said...

Yes, I think you have it. You are right that there would be a small tax in overall performance, since one is forcing pores to go into the "waiting for new DNA" state more often.

Interesting that two open-source basecallers have appeared this week.

NanoCall is fully local and can apparently keep up with the current MinION (~70bps) with only 8 cores. Now, with the R9 pores running at close to 300bps, that may require corresponding larger horsepower, though the model may be simpler since most of the signal is from 3 bases.

DeepNano relies on Metrichor for some critical parameters, but is very fast after that. It looks like NanoCall could generate those parameters, so perhaps the two can feed off each other -- DeepNano is claiming a higher accuracy on 1D than Metrichor achieves (again, using an R7 chemistry).

Matt Loose said...

Great post.

Just thought I'd chip in here on the comment that you have "some overall throughput wasted on rejected reads". First the read lengths that you are quoting are not long enough to illustrate the benefit of read until. You need to think about much longer read lengths. In the example we use in the manuscript we focus on 2Kb amplicons. We'll stick with the 1D example you provide, so 2kb at 70 bases per second will take just over 28 seconds to sequence. If you are not interested in a specific sequence - say you have already seen this enough copies of this specific sequence or it just isn't of interest to you - then without read until you will have to wait 28 seconds until you can sample another read. With read until as we have implemented it, you collect 3.5 seconds worth of data (approximately 250 bases), analyse and choose if you want to sequence that read or not. That round trip might take a second with a further 1 second 'downtime' as the pore reloads another molecule. Thus you would be free to move on to the next read in 5.5 seconds, rather than 28 seconds, saving 22.5 seconds and allowing you to use that time for another read. Of course - as you move to larger reads, the savings become substantial - If you can match a 20kb read in 5.5s you will save over 4 minutes in sequencing time by rejecting it if it is of no interest.

Thus with read until I would argue that you aren't wasting time on rejected reads, rather you are saving time on sequencing unwanted reads.

Now the figures I am quoting above are based on a perfectly tuned system, but ultimately read until should allow you to achieve a specific sequencing goal faster than you would otherwise. Exploring these dynamics is extremely interesting and worth thinking about in some detail.

There are many other methods that can be applied to squiggle matching - and they all have accuracy/speed trade offs. I think that the new potential opened up by this type of approach requires careful thought to fully optimise and exploit maximally. As sequencing speed and throughput increases, the matching challenge becomes harder and in some respects the benefits may lessen (imagine the TBs of throughput suggested on the PromethION and consider the trade off in compute power versus just sequencing some more). But you might be able to exclude 'contaminating' reads from your data set. Or exclude specific regions of DNA from analysis. Or enable the analysis of subsets of a patients genome... Or choose to sequence potentially variant regions at much greater depth... Or dynamically consider haplotypes... My personal favourite is the concept of an interactive assembly - grow your assembly by choosing to sequence reads mapping to contigs and scaffolds as they 'grow'.