Rapid advances in DNA sequencing technology in terms of capability, throughput and cost effectiveness have driven a wide array of changes in biology and medicine. This is even starting to push into the public consciousness via mail-in genetic test kits or sophisticated at-home biology experiments. As has been repeated ad naseum, advances in sequencing capacity have meant that the cost per unit of sequence (nucleotide bases aka bases) has been dropping faster than the pace of Moore's Law on the compute side.
The main driver of these advances is what is sometimes termed "massively parallel sequencing" (unfortunately more commonly called "next generation sequencing", which begs the question of what to call any later advance). A typical setup is that the sequencer contains an array of DNA molecules, which can be either specifically patterned or simply random locations with sufficient separation. These spots may be occupied by truly a single DNA molecule, or an ensemble of exact copies that serve to amplify the signal.
Most of the advances in sequencing capacity have come from increasing the number of molecules in a given run via increasing the density of spots in the array, to millions or even billions of spots per device. Most of these technologies read the spots synchronously in a series of cycles, with each cycle reading one more base from each spot. In the case of the ensemble methods, keeping all the molecules in lockstep proves impossible over very long numbers of steps, so these methods are limited in the length of DNA they can read in each fragment to hundreds of basepairs.
Single Molecule Long Read Sequencing
A few newer technologies place individual DNA molecules at each location and sequence them asynchronously -- data simply keeps rolling in from each location. The physics of these systems are quite mind-blowing. For example, Pacific Biosciences SMRT technology involves illuminating sub-wavelength nanomachined wells so that individual enzyme molecules can be spied on in real time. Nanopore sequencing involves measuring current differences from a single hole (pore) in an insulating membrane as it is partially occluded by the molecule subunits of a single DNA molecule passing through it. Because these systems do not require keeping ensembles of molecules in lockstep, the length they can read from a single input molecule can be tens of thousands of bases or even over a hundred of thousand bases. Limits on reading are set by new issues, such as the challenge of keeping a huge, floppy DNA molecule from simply flopping and kinking itself into breaking.
A key characteristic of most of these technologies is the set of DNA molecules to be sequenced is fixed at the start of the run. DNA is in some way attached to a solid component of the sequencer, and that DNA will be at that location throughout the run. So while you might be tempted to peek at results before the run is complete, and that could be useful to knowing the answer sooner, one cannot productively alter what will be sequenced.
There is now a single molecule sequencer family for which this is not the case: the Oxford Nanopore MinION and PromethION instruments (with MinION actually in the field and PromethION expected to begin shipping to beta sites imminently); they differ only in scale. These instruments have very small numbers of active sites -- at most 500 in the current incarnation of MinION, but the active pores on their chips have a duty cycle. Correctly prepared DNA molecules are grabbed by a pore and run past the sensing element. When the end of the DNA molecule is reached, then the pore is now available for a new DNA to dock and repeat the cycle.
In an imminent incarnation, DNA is read by the pores at nearly 300 bases per second; the future ceiling for this speed is claimed to be in excess of 1000 bases per second. There's also a twist on this called 2D sequencing, in which some biochemical trickery has been used to connect the tail end of the strand first sequenced by the pore with the head end of the complementary strand. That linkage is referred to as a hairpin (because if you draw it out, that's what it looks like). The advantage is that by reading both strands of DNA and combining the information, a higher accuracy can be acheived.
Since the MinION is relatively inexpensive ($1000), the size of a USB device and requires little other support hardware, a very real possibility of taking DNA sequencing anywhere in the world is opened up. This could be highly valuable for tracking emerging diseases such as Zika or Ebola, as well as useful for building lightweight scientific infrastructure in developing nations. But, the capacity of MinION is small compared to many of the other devices, so how can we use it effectively?
What makes this technology particularly unusual, and the focus of this article, is that a given pore can be instructed to nearly instantaneously eject a DNA molecule it is processing. In a colorful description coined by others, the system can "taste" DNA and spit out those it finds unpalatable. The pore is now available to grab another module, and taste that molecule and so forth. This "read until" approach was recently demonstrated empirically by Matt Loose and colleagues. A small catch: in 2D mode, spitting is only permitted before reaching the hairpin; once the pore has swallowed the hairpin it is committed to reading the rest of the molecule.
That opens up a whole world of possibilities -- but also a requirement for rapid computational decision making rarely seen in bioinformatics. If the DNA is translocating at 280 bases per second, then a 2800 basepair fragment is only 10 seconds long!
Barcodes, Amplicons and Fragments
What sort of things might we be tasting and why? At a crude level, we can compress the world of possible DNAs into three categories: barcodes, amplicons and fragments.
Barcodes are human-designed and synthesized sequences used to tag different samples, so that multiple samples can be run simultaneously. Typically short (perhaps 10-30 bases), simply because our limited ability to inexpensively make DNA. However,these can be designed with purpose, and so there is a large (and still growing) body of literature on how to design barcodes to avoid confusing them. Barcodes can also be applied multiply, either tagging different ends of a molecule or applying multiple barcodes in series. For example, iIf we have our samples organized in the common format of plastic plates with 12 columns by 8 rows of wells, then one barcode might indicate the plate, another the column within a plate and a third the row. In some cases, a set of barcodes is the entirety of the DNA to be read, but more commonly barcodes are attached to amplicons and fragments.
Amplicons are DNA segments in which we know, at least to a close approximation, each end of the DNA and probably have a strong notion of what is in-between those ends. In some cases we might know it exactly, with the goal being to count the number of occurences of that fragment in a sample, but more commonly we care about small differences within that fragment. For example, the human gene KRAS is mutated in approximately 50 % of human cancers and mutations in KRAS can help inform treatment. The majority of KRAS mutations in cancer occur in 6 consecutive base positions, so using the technique of PCR we can generate amplicons that encompass those 6 positions.
Fragments are simply random fragments of DNA -- we extract DNA from some source and break it into pieces, or perhaps (for this piece) we treat as a fragment those cases in which a DNA (or RNA) is small enough that the intact molecule can be sequenced in its entirety.
Balancing, Filtering and Needle Finding
Why might we want to utilize read-until? The use space really runs from two extremes: balancing and needle-in-a-haystack. I'll call the middle of that range filtering.
In balancing, we want to make sure we have sampled effectively from a set of known molecules. For example, if we have a set of amplicons from multiple samples (indicated by barcodes), we might wish to ensure a uniform sampling of each of these, despite a lack of uniformity in the input.
With filtering, we expect in advance that a small number of molecular species will be grossly over-represented. For example, if measuring a bacterial cell's current genetic activity state, we want to look at RNA. However, the vast majority of the RNA in the cell is ribosomal RNA, which isn't informative. That ribosomal RNA may be 99% of the molecules we see, but is entirely chaff from our point of view. Another example of filtering might be if we have cloned DNA within an E.coli host. The cloned material might represent 1% or less of the total DNA, with the background E.coli effectively overburden we must sift through.
At the extreme, we can imagine cases in which we are truly looking for a needle in a haystack. For example, an exciting possibility for the rapid diagnosis of disease is to draw biological fluids from a patient (such as cerebrospinal fluid or pus) and sequence them. However, most of the extracted DNA will be human. The faster you can find the rare pathogen DNA in the sample, the sooner the patient can be put on the appropriate therapy. In another application, if you are sequencing all the DNA found in a soil sample, you might keep track of what you've seen repeatedly, and try to sequence in the future only that which is novel.
A key distinguishing characteristic of these three modes is the size of the universe that a read-until algorithm might search through. With barcodes, the universe might be only 10^3 or 10^4 bases in total. Amplicon sets might be more like 10^5 or even 10^6. Bacterial genomes are in the 10^6 range, but humans are 10^9
What I've left out so far is a serious catch. DNA is beautifully digital in nature, most commonly just 4 simple letters. Unfortunately, the nanopore devices don't easily generate such data -- at a molecular level the pore is interacting with multiple bases in a row. Even if just 3 bases, 4^3 is a lot of electrical levels to correctly disambiguate. So various algorithms, from Hidden Markov Models to Recurrent Neural Networks are used to convert the electrical signal to bases, and even after that is done the basecalled sequence may have an accuracy in the 65-85% range (this in general gets better with improvements in the chemistry and basecalled, but error rates are still significant). So, either one must wait for a slow basecaller to run (while the DNA continues to traverse to pore) or develop ways to a ascertain whether a sequence is desired (or not) based on the original electrical signal. For example, As Doc Daneeka once said, that's some catch.
However, the prize is definitely worth the game. Make the call quickly enough, and substantial savings should be possible. For example, if the input DNA has a mean length of 15,000 bases (or about 100 seconds) and you can eject unwanted DNA after 10 seconds, that's quite a save and should enable a far more sensitive needle-in-a-haystack search.
[added after some Twitter feedback]
But you'll need the right algorithm, with execution speed a critical factor -- run too long and the opportunity to make a useful decision is lost. Loose's group used Dynamic Time Warping (DTW) to compare the signal from the sequencer to simulated signals from reference sequences. Unfortunately, as they note, DTW scales as the product of the lengths of the input and reference sequence, so DTW may be unsuitable for the needle-in-a-huge-haystack problems. The human genome is about 100,000 times larger than the PCR amplicon set they were using, so scaling could be a real issue if you are trying to filter out human sequence. Even the E.coli filtering problem represents about 250-fold more complex a target set.
[expanded and edited]
Which brings up the issue of precision and recall; the nanopore data is certainly not error-free, with several error modes. The appropriate balance between precision and recall will be application-specific, but in general I'd say that poor recall (false negatives) are going to be the more serious issue, particularly when looking for very rare events: If the event you are looking for is one in a million, you can't afford to call it wrong once. On the other hand, letting through extra chaff hurts ones efficiency, but doesn't compromise the overall goal -- as long as the filter is sufficiently stringent to make a difference.
If the speed of the nanopores continues to increase, then that could change when read-until actually makes sense. For example, if the pores really do someday run at 500b/s, and most PCR amplicons are only 2500 bases or shorter (which is quite common), then the amplicons are only 5 seconds long -- not much time to make a decision. Still, even at 1000 bases per second a 50 kilobase molecule would be nearly a minute long, and so could be useful to avoid if it is an uninteresting 50 kilobase molecule.
Come Plow Some New Ground!
If you're a programmer type, here is why I think you should get excited about this problem and start working on it. First, advances in this space could have a sizable impact on point-of-care healthcare, which could change, for example, how we decide which antibiotics to use for infections (or whether to use them at all. But the other side of the coin is that this is a nearly un-plowed field in bioinformatics, and we have far too many with absurdly deep furrows.
Take, for example, if you wish to align the output from the market-dominant Illumina machines to a known genome. Somewhere in the neighborhood of 100 different programs have been published for this task (another blogger ranted about this situation a few years ago, when there were only about 70). Very few of these are in wide use, because they are particularly good or work well under some special condition (such as very low cost, low memory hardware). Many were mostly demonstrations of this algorithmic idea or that piece of specialized hardware. Many were developed simultaneously. Some probably don't have any clear reason for existing. Few are published with truly solid benchmarks; all too often the benchmarking is either absent or against some set of obsolete aligners of no interest to the community.
And that's just one problem. We have fleets of de novo assemblers, and again there is a small set which have ever seen much use and rarely does a new program truly improve on the best of the existing. There are long queues of programs to do paired end merging, in which you have two sequences that map overlap at their ends by a tiny amounts. This is useful but incremental to other processes. Hordes of tools to trim the ugly bits from the ends of reads. And so on.
But here is a truly demanding problem in which there probably will be multiple optima. Some will wish to do this in a well-established laboratory in the developed world, and so having a power-hungry bank of GPUs will be an option. Others will want to be sequencing field samples by the light (and perhaps power) of a campfire, and will need solutions that use no more than a laptop, tablet or cell phone for compute.
If you do decide to wade into the field, the other mistake I urge you to avoid is to build your code from the ground up. This is the other issue with all those similar programs: they're nearly all built on their own custom platforms. Nice would be to just have consistent command line argument or configuration file conventions; better would be if these tools actually improved each other. But that is rarely the case. If you like Python, there is Poretools, and if you like R there is poRe. Java aficianados already have two packages to choose from (can one can hold out hope for a future merger?), HPG Pore and npReader. Alas, if you like Julia there isn't anything there, though the first user-created code ever released to the Oxford Nanopore community was written in Julia.
So, anybody want to dive into the brand new world of real time selective sequencing?