Thursday, July 02, 2015

Leaky clinical metagenomics pipelines are a very serious issue

Update: Some significant issues with the tone of this post are discussed in a follow-up.

I am a firm believer that the practice of science is the result of contingency; we do not necessarily have the best scientific culture possible but rather one which has evolved over time driven by chance, necessity and human nature.  We should never hesitate to re-examine the way science is actually practiced, and that particularly holds true for how we analyze data and publish results.  A re-analysis of a prominent Lancet paper has just come out in F1000, and this work by Steven Salzberg and colleagues illustrates a number of significant issues that slipped past the conventional peer review publishing practice

The 2015 paper in question (Greininger et al) was from a large group including Joe DiRisi, and was related to (but distinct from) the juvenile encephalitis case study that yielded one of my most read entries in this space.  The informatics methods used both in the case study and the paper in question were published as a separate paper in Genome Research, and it is this tool (SURPI) which has been called into serious question, with perhaps some very serious consequences.

The Greininger paper used shotgun metagenomic sequencing to strengthen a suggested link between enterovirus D68 and a polio-like syndrome called acute flaccid myelitis.  While Salzberg and colleagues agree (as noted in a correspondence reported by homolog.us) that many of the samples have enterovirus D68, their first criticism is that several samples present an alternative hypothesis of serious bacterial infections.  While this did not significantly affect the implication of enterovirus D68 in the disorder, since these same tools are being touted as capable of affecting treatment decisions this discrepancy is of significant concern.  Failing to detect bacterial infections that might be treatable with antibiotics would be a serious miss in a clinical setting, as would missing that one of these bugs is multi-resistant Staphylococcus aureus.  Now, there has been a suggestion on Twitter by Nick Loman that perhaps this diagnosis is a bit strong based on metagenomics, but raising this possibility would seem (to this non-clinician) to be important.

The other issue uncovered by Salzberg's group is also troubling. Metagenomic data generated from patient samples are invariably contaminated with large number of human reads. For fast decision making these are simply a burden which the pipeline must deal with.  But when depositing data in unrestricted public databases, the failure to fully screen out human sequences before release represents a serious breach of ethics and patient trust.

It is easy to say that the standard scientific process failed on multiple levels. Patients and their privacy are supposed to be protected by Institutional Review Boards (IRBs), but these serve primarily to review proposed research protocols.  It is exceedingly doubtful that any IRB has ever run a code review or tested software.  Peer reviewers of the paper might try to test out software, but that is rare (I'll confess to browsing datasets but I'm not sure I've ever tried to run software) and in this case that would mean the data breach had already occurred.

The SURPI paper had drawn some flak for claiming in the title to be "cloud-enabled", which smells strongly of buzzword mongering.  After all, unless you've coded your tool in assembly code for a Commodore 64 that was in your closet, it's cloud compatible.  Having used Amazon EC2 extensively I can speak from experience: cloud machines are no different from local machines, other than the fact you can rent huge numbers of them.  SURPI doesn't appear to even have any particularly cloud-friendly attributes, such as easily running jobs distributed across a large number of machines, which is that really great aspect of the cloud.  The SURPI paper measures the performance of their pipeline in their "benchmark" section, but does not compare it to any other pipeline.

A lack of robust benchmarking is a frequent theme in bioinformatics, and a simple to suggest (though complex to implement) solution would be yet another comparison contest or effort ala CASP, CAGI, GAGE, Assemblathon, etc.  Given how important it is to get these sorts of tools right, that would be an important step.  The nucleotid.es effort at continuous benchmarking of assemblers is perhaps an even better model, automating much of the drudgery and enabling on-going head-to-head comparisons.  But just as critical as the testing framework is making sure we ask the right questions (as opposed, say, to a thread in which someone was asking what N50 value indicated a high quality de novo transcriptome assembly).  Sensitivity for detecting pathogens is critical, but as this paper illustrates there is an importance to screening out human sequences that can go beyond speeding the pipeline.  Speed is important too, but must be traded-off with other attributes. 

Further complicating  the problem is the incipient rise of true real-time sequencing approaches with analysis occurring in parallel with data collection and the potential to make decisions as soon as sufficient data has arrived to meet a defined threshold (and shut the run down).  Such data may require a whole new round of data simulators which can model the arrival of such data.  Complicating things further for benchmarkers is the potential to terminate sequencing of individual DNA molecules if the initial data from that molecule hits some pre-defined pattern ("read until" in Oxford Nanopore-speak).

Devising truth sets and evaluating results won't be simple either, and perhaps the hardest cases are some of the most important to solve.  For example, aligning to a standard human genome reference was a key part of the SURPI pipeline, but such a method would clearly leak through sequences belonging to genome segments missing from standard references -- and such rare sequences by their rarity would be expected to be highly identifying of a patient.  The new wave of graph-based aligners attempt to address the problem of private sequences, but again the use case we discuss here is an acid test for alignment false negatives. 

Hard problems, but important ones.  Mistakes made early in a field are understandable and forgivable.  What isn't is if the field continues forward without incorporating the lessons from these mistakes.  Disease diagnosis and outbreak tracking from shotgun metagenomics holds great promise, but that can't come at the price of undermining patient trust that high standards of privacy protection of adhered to.


6 comments:

Keith Robison said...

It has been pointed out privately to me that the original paper did discuss the bacterial reads and ruled them out as a cause of the disease due to the common presence of these organisms in the upper respiratory tract (and the the H.influenzae sample was not a case of paralysis). So the criticisms of Salzberg and company on that point may have been amiss.

Keith Robison said...

Also, the original authors have written a response to the F1000 criticism.

Steven Salzberg said...

We have posted a response to the response on F1000, though it hasn't appeared yet. (We posted it Friday 3 July.)

What the original paper had was a summary table (Suppl Table 3) listing the total numbers of bacterial reads in each sample. This did indeed indicate that they found bacteria in each of the two samples we focused on, but it didn't say what species. Then, in their Supplementary Table 5, they list about 75 different species, broken down by sample, showing how many reads from each species were in each sample. That table did not include any reads at all from either Haemophilus influenza or Staphylococcus aureus. So the claims in our paper are correct: from a reading of Greninger et al's published paper, there is no report that these species were found.

The authors responded to me privately as well, and say they knew about the Haemophilus and Staph aureus, but didn't think it was significant. That may be true - it's impossible to verify - but then it is hard to explain why they include a table with 75 different species found in many other patients/samples and didn't include these.

We stand by our results and maintain that if the authors did find these bacteria and had ruled them out, they should have at least mentioned that in their paper. And I'd add that in their response, the authors agreed that our other main finding, the unintended deposition of human reads from these patients into a public archive, was correct.

Jonathan Jacobs said...

Keith - excellent points all around. The pre-NGS (wet lab dehosting) and post-NGS (in silico removal) filtering of human reads from metagenomics clinical micro Dx samples is something we are tackling with as well. It's a tough challenge - especially if you consider that h19 alone is not going to cut it. (Have you read J.Allen's paper on human <-> microbial genome cross contamination? it sort of scopes the problem pretty well). In any case, for now the approach seems to jump dump these reads. Sure-they still show up in the raw FASTQ from sequencer; but presumably those don't leave the lab.

Also - regarding SURPI: I can tell you first hand that -- sensitivity aside -- if you were to run SURPI, GOTTCHA, KRAKEN side by side... you would likely get different "diagnostic calls". The problem with incongruence between algorithms gets worse if you start looking at trace detection as well (where your human gDNA dehosting didn't work so well; you are often 10000:1 or worse with human:pathogen reads). Lastly - metagenomics pipelines ≠ diagnostic pathogen ID pipelines. Microbial community profiling (in my mind) has a different set of requirements than a pathogen Dx, but there is a undertone in the community that existing metagenomics tools are good enough for pathogen Dx. They are not - for many reasons. First, a clinician is likely to care only about a specific set of pathogens, and the presence/absence of phenotypic predictors associated with those pathogens; not the entire milieu of microorganisms in a sample. Second, the reporting of metagenomics data can easily reach the TLDR or TMI thresholds, thus killing their utility in a clinical reporting context. Third: know of any infectious disease Dx bioinformatics pipelines that are 21 CFR Part 11 compliant? HLA7? HIPAA? OK.. how about pipelines that keep full audit trails of all data manipulations? Unless these issues are addressed; metagenomics Dx for infectious diseases will never get out of the research lab, let alone see any sort of FDA approval.

Hopefully some smart software developers are working on solutions for this problem - and not approaching it as strictly a #metagenomics research issue. We need benchmarking - yes - and purpose built tools for the problem at hand (not a re-tooling of existing research software). ;)

Charles Chiu said...

Please see our reply on F1000 addressing the comments on the previous post by Dr. Salzberg.

Charles Chiu said...

Hi Jonathan - we are developing a clinical version of SURPI under a HIPAA-compliant framework in collaboration with a company and also actively validating the pipline in a CLIA-certified laboratory. It would be great to get in touch with you to discuss ways to work together to move this forward. -Charles