I am a firm believer that the practice of science is the result of contingency; we do not necessarily have the best scientific culture possible but rather one which has evolved over time driven by chance, necessity and human nature. We should never hesitate to re-examine the way science is actually practiced, and that particularly holds true for how we analyze data and publish results. A re-analysis of a prominent Lancet paper has just come out in F1000, and this work by Steven Salzberg and colleagues illustrates a number of significant issues that slipped past the conventional peer review publishing practice
The 2015 paper in question (Greininger et al) was from a large group including Joe DiRisi, and was related to (but distinct from) the juvenile encephalitis case study that yielded one of my most read entries in this space. The informatics methods used both in the case study and the paper in question were published as a separate paper in Genome Research, and it is this tool (SURPI) which has been called into serious question, with perhaps some very serious consequences.
The Greininger paper used shotgun metagenomic sequencing to strengthen a suggested link between enterovirus D68 and a polio-like syndrome called acute flaccid myelitis. While Salzberg and colleagues agree (as noted in a correspondence reported by homolog.us) that many of the samples have enterovirus D68, their first criticism is that several samples present an alternative hypothesis of serious bacterial infections. While this did not significantly affect the implication of enterovirus D68 in the disorder, since these same tools are being touted as capable of affecting treatment decisions this discrepancy is of significant concern. Failing to detect bacterial infections that might be treatable with antibiotics would be a serious miss in a clinical setting, as would missing that one of these bugs is multi-resistant Staphylococcus aureus. Now, there has been a suggestion on Twitter by Nick Loman that perhaps this diagnosis is a bit strong based on metagenomics, but raising this possibility would seem (to this non-clinician) to be important.
The other issue uncovered by Salzberg's group is also troubling. Metagenomic data generated from patient samples are invariably contaminated with large number of human reads. For fast decision making these are simply a burden which the pipeline must deal with. But when depositing data in unrestricted public databases, the failure to fully screen out human sequences before release represents a serious breach of ethics and patient trust.
It is easy to say that the standard scientific process failed on multiple levels. Patients and their privacy are supposed to be protected by Institutional Review Boards (IRBs), but these serve primarily to review proposed research protocols. It is exceedingly doubtful that any IRB has ever run a code review or tested software. Peer reviewers of the paper might try to test out software, but that is rare (I'll confess to browsing datasets but I'm not sure I've ever tried to run software) and in this case that would mean the data breach had already occurred.
The SURPI paper had drawn some flak for claiming in the title to be "cloud-enabled", which smells strongly of buzzword mongering. After all, unless you've coded your tool in assembly code for a Commodore 64 that was in your closet, it's cloud compatible. Having used Amazon EC2 extensively I can speak from experience: cloud machines are no different from local machines, other than the fact you can rent huge numbers of them. SURPI doesn't appear to even have any particularly cloud-friendly attributes, such as easily running jobs distributed across a large number of machines, which is that really great aspect of the cloud. The SURPI paper measures the performance of their pipeline in their "benchmark" section, but does not compare it to any other pipeline.
A lack of robust benchmarking is a frequent theme in bioinformatics, and a simple to suggest (though complex to implement) solution would be yet another comparison contest or effort ala CASP, CAGI, GAGE, Assemblathon, etc. Given how important it is to get these sorts of tools right, that would be an important step. The nucleotid.es effort at continuous benchmarking of assemblers is perhaps an even better model, automating much of the drudgery and enabling on-going head-to-head comparisons. But just as critical as the testing framework is making sure we ask the right questions (as opposed, say, to a thread in which someone was asking what N50 value indicated a high quality de novo transcriptome assembly). Sensitivity for detecting pathogens is critical, but as this paper illustrates there is an importance to screening out human sequences that can go beyond speeding the pipeline. Speed is important too, but must be traded-off with other attributes.
Further complicating the problem is the incipient rise of true real-time sequencing approaches with analysis occurring in parallel with data collection and the potential to make decisions as soon as sufficient data has arrived to meet a defined threshold (and shut the run down). Such data may require a whole new round of data simulators which can model the arrival of such data. Complicating things further for benchmarkers is the potential to terminate sequencing of individual DNA molecules if the initial data from that molecule hits some pre-defined pattern ("read until" in Oxford Nanopore-speak).
Devising truth sets and evaluating results won't be simple either, and perhaps the hardest cases are some of the most important to solve. For example, aligning to a standard human genome reference was a key part of the SURPI pipeline, but such a method would clearly leak through sequences belonging to genome segments missing from standard references -- and such rare sequences by their rarity would be expected to be highly identifying of a patient. The new wave of graph-based aligners attempt to address the problem of private sequences, but again the use case we discuss here is an acid test for alignment false negatives.
Hard problems, but important ones. Mistakes made early in a field are understandable and forgivable. What isn't is if the field continues forward without incorporating the lessons from these mistakes. Disease diagnosis and outbreak tracking from shotgun metagenomics holds great promise, but that can't come at the price of undermining patient trust that high standards of privacy protection of adhered to.