Monday, October 17, 2016

Sequencing, not random probes, are future of microbiological diagnostics

Okay, I'll admit I take requests.  Throw a topic at me by Twitter or email, and if it piques my interest and I feel like I can say something intelligent, then I'll take it on -- but not necessarily instantly.   That's the genesis of today's item, a tweet from Kyle Serikawa directed at me, asking if a new paper  from groups at Rice and Baylor College of Medicine in Science Advances on a proposed microbial diagnostics (a paper highlighted by Eric Topol) had any legs.

I'll fully confess that I realized upfront that this paper cuts across my usual biases, so I resolved to try to read it with that in mind.  That's not really an excuse in the delay; I did read the paper right away, but the usual conflicts stretched out writing it up.  The paper, in brief, proposes a non-sequencing approach to an area which I had previously thought was going to be totally dominated by sequencing, and I inherently like sequencing.  After reading the paper, I would agree with Kyle: I'm not convinced that this method has legs.  It doesn't help that the paper has some serious issues with describing the methodology, as well as utterly failing to make its computational methods available.
The underlying idea behind the paper is elegant: can a small number of essentially universal probes be used to identify one or more bacteria in a sample (or other organisms; the paper focuses on bacteria but the methodology applies to any nucleic acid of interest).   The usual non-sequencing approach is to design assays, such as probes or PCR amplicons, to specific targets and then attempt to multiplex these to provide a single assay scoring all the organisms.  Back before the turn of the century, one of my Millennium colleagues sketched out such an idea using microarrays.  If you wish to detect more species, the number of probes or amplicons must be increased  In contrast, this papers takes a different tack of creating a small number of short and/or sloppy probes (in this case, molecular beacons) which will individually hybridize to many genomes of potential interest.  Given a set of known sequences and appropriate estimates of the noise characteristics of the probe methodology, a given probe signature can be used to estimate the organisms in it.  The presence, but critically not the identity, of novel organisms can be inferred from signatures that do not match those which can be computed from the set of reference organisms.  

The paper performs many of the expected sorts of validation tests.  For example, using five random probes, nine different individual organisms can be delineated from each other.  In a simulation, fifteen probes were sufficient to distinguish forty different pathogenic bacteria from a CDC database.  Further simulations suggested that with sufficient number of probes, even complex mixtures of bacteria, can be deconvolved with sufficient numbers of probes.  I would have included this last figure, but it's in the PDF supplement, which is extremely unfortunate since it is quite a strong claim.

The paper uses molecular beacons as the sensors, hairpin oligonucleotides with only 4 base stems and 34 base loops.  Hybridization to the loop disrupts the stem structure, separating the FRET probes on the ends of the oligo, which in turn generates a signal.  However, they also make the claim that a large number of other methods could be used, including qPCR, microarrays, or sequencing.  One example I'd throw out: this might be a particularly apt method to format into Oxford Nanopore's proposed Cas9 tag method, as the challenges of getting absolute specificity with Cas9 will become a feature not a bug in this methodology (so long as the sloppiness can be predicted).

So what are my problems with the paper and the method?  I've mentioned already that I think it was a mistake to put the mixture simulation in the supplementary methods.  Conversely, too many of the claims in the paper are based on simulations.  Simulations are useful, but are no substitute for actual experiments, particularly when making large claims.  

If I had reviewed the paper, another complaint I would have made would have been in the documentation of methods: it is weak in some places are worse in others.  The authors describe a number of computational steps.  There is a method for selecting random probes given a reference genome database, which they call GPS. There is the method for taking the probe signals and a database and computing the organisms within it. Code is provide for none of these.  I read through the methods section for the experimental detection of the beacon detection multiple times, and still came away unsatisfied.  The first part of the description makes it sound like the multiple probes were all in the same tube, but the later parts suggest each probe was in a separate tube.  The detection methodology relied on Cy3/Cy5 ratio measurement, which I think was being used to detect a single probe but clearly could at most detect two.

The mixture studies are a start, but would seem very preliminary at best.  For example, in a clinical sample the most serious background DNA may well be the patient's own DNA. So this becomes perhaps the most important potential confounder of this methodology: the human genome, around 1,000 times the complexity of a typical bacterial pathogen.  Human contamination might range from minimal to extreme; how would this affect the sensitivity and specificity of this system?

But the more serious issue is how would this stack up against sequencing?  The biggest drawback of the method that I would imagine (though I don't claim to be an infectious disease specialist, though I am working currently on antibiotic discovery) is the lack of fine-grained resolution.  The authors show distinguishing different strains of E.coli, some labeled pathogenic or non-pathogenic. But given that so many clinically-relevant features can be at the level of a single plasmid, will this method detect such?  Could it distinguish anthrax from its almost ubiquitous, but harmless, Bacillus cousins, Conversely, will the method with real world samples constantly report the presence of unclassifiable portions of the signal?

It's only possible to compare the method head-to-head with sequencing is to compare a specific method, so I'll work with the specific molecular-beacon protocol listed here.  After sample extraction (here overnight proteinase K digestion followed by phenol/chloroform extraction; possibly a dirtier method would have sufficed).  Suppose we perform a gedanken experiment in which the latest Oxford Nanopore MinION is raced against the universal beacon method.  Let's make a further assumption that the MinION can be primed during the DNA preparation step.  

For MinION, I'm assuming the new 5 minute transposase prep generating 1D reads is sufficient; this is plausible but hardly proven.  If 2D reads are required, then that about another 80 minutes longer before MinION starts generating reads (but makes the time required for priming a moot point). The newest MinION kits are generating up to 10Gb in 48 hours for ONT -- at least in ONT's hands (yield drop in the field has been a persistent issue with MinION).


The molecular beacon approach requires no library preparation -- just mix the probes with DNA sample and then a simple 10 minute thermocycle profile -- followed by overnight hybridization.  Overnight is always a rather imprecise timing, but let's call that 8 hours.  So in that time the MinION should generate on the the order of 1.5-4Gb of 1D data.  That's a lot more data than previous nanopore publications could get from a single flowcell in 48 hours, let alone in 8.  For genus-level and species-level identification, even with background DNA, this will likely be sufficient.  Detection of plasmids will also Most importantly, anything unusual in the sample will be detected at the sequence level, though such sensitivity raises again the issue of making sense of a flood of unexpected sequences coming from samples.   The MinION approach also would be reporting data continuously, with the possibility of finding an actionable result at almost any time; probe reporting is a single event after the hybridization -- and of going for higher sensitivity by running longer.

The cost of such a platform, both upfront and per sample, would be an important consideration if it were really to be placed in every medical office and hospital. However, this is much harder to tackle, given the nebulous nature of the beacon approach.  Perhaps a small handheld device could be created, along the lines of portable qPCR devices, to perform all of the post-PCR steps and able to be manufactured for .  Reagent costs might be quite inexpensive, so that even hundreds of probes could be read out for a hundred dollars or so, though that will depend on the exact nature of the detection methodology (for example, if each probe requires its own reaction chamber, then the required optics become much more complex).  Based on current MinION pricing, the device is $1K (but that comes with an initial flowcell), the flowcells might be $400-$500 in bulk and the library prep on the order of $50-$100, so perhaps more expensive.  There's also a bit more skill involved on the MinION side; the new 5-minute protocol hasn't been released, but it presumably has a few pipetting steps plus priming the MinION, whereas the beacon protocol appears to be single tube.

So what's my final verdict?  I'm going to go the same direction I expected to go: with sequencing.  For an increase of cost that is significant, but not extravagant, the time-to-actionable-result is much shorter and the sensitivity for small changes is greatly enhanced.  Perhaps the universal probe method in the paper can be reformatted onto a faster overall platform, but its expected lack of sensitivity for small but critical sequence changes would always be a concern, as well as the problem of giving only possible warning of a novel pathogen but no specific information on such.  In any case, it needs to be proven out with a wide range of actual samples bearing appropriate human and other contamination of differing concentrations, not just simulations.  Perhaps the method could be useful in some specific situations, with some intense further development.  For example, if it could be formatted to work on extremely inexpensive equipment with extremely inexpensive consumables (for example, if it be put into a paper strip format and then read with a cellphone camera), perhaps this methodology could be valuable in developing nations or other places where cost is an extreme constraint.  But for general use in developed nations, I think the advantages of rapid sequencing are great, and will only get better as that technology is developed aggressively for a wide range of uses.

P.S.  Thanks Kyle for pointing me at this -- it's been fun to read it and think about it.


1 comment:

Unknown said...

Hi Keith,

Thanks for doing such a thorough analysis--you caught a number of things that I missed when I went through the paper myself. My reaction was, as you point out, sequencing on Nanopore is still very much in the growth phase and likely to get quicker and cheaper than this method in the near future, if it isn't already. I also share your concerns with contamination effects. One of the recent pathogen ID nanopore papers I've read started with Urinary Tract Infections for what I assume is the basic reason that human DNA in urine is present at a much lower level (although still an issue) than it is in human blood.

I also think this approach suffers from the same reason sequencing for discovery so rapidly overtook microarrays: when you use a fixed set of tags to query, you can only discover what you originally put into the assay. With sequencing, whatever is out there you can discover, no preconceived notions. So much of our non-coding RNA knowledge exists only because we could sequence and discover hey, there's a lot of other stuff being transcribed that doesn't correspond to our known gene models.

I do think it's a clever approach, and like you say, in some circumstances might be a useful alternative to sequencing. But probably only in a very targeted use case.

Kyle