Last week's piece concerned an F1000 paper by Steven Salzberg and colleagues which reanalyzed a clinical metagenomics dataset and alleged two serious problems. The first claim was that the software used in the paper failed to detect potential bacterial pathogens in the samples. The second claim was that human sequences remained in the data deposited in the SRA for unrestricted access.
In response to my piece, Dr. Charles Chiu, who was the senior author on the targeted paper, wrote to me with a number of critical points. First, as a simple (but embarassing) point of fact, I had misidentified Joe DeRisi as an author on the paper; De Risi was an author on the paper ((Greininger et al) describing the SURPI analysis pipeline, but not the paper criticized by Salzberg. I had tried to check the author lists, but succeeded in confusing the two, which was dumb and sloppy.
A more serious issue that I have realized via my discussion with Dr. Chiu is that I wrote the original piece largely accepting the claims of the F1000 piece, which was a serious error given that I had not read the Greininger paper. The paper, published in Lancet, is unfortunately inaccessible to me -- another of those early goals was to be no net cost to blogging, and I do not have access to Lancet. Lancet has a strong paywall model in which no content is free, including the supplementary materials. The appropriateness of paywalled science is not a topic I wish to tackle here at this time, only that if I can't actually evaluate the claims made in a paper criticizing another paper, then I should be only reporting them in a neutral tone, avoiding any appearance of taking sides. I failed in that, which I apologize to Dr. Chiu and his group for this serious misstep on my part.
Given how I wrote the piece, the title ("Leaky clinical metagenomics pipelines are a very serious issue") is a bit problematic; it is already skewing the discussion towards assuming the pipeline under question leaked. The remainder of my discussion of the need for standards and good benchmarks stands well, but should have had a more neutral introduction. Titles are frequently a very late addition to a piece (it's amazing I don't routinely forget to add one), and this was a last minute inspiration. If I had been more careful, what might the title have been? Pehraps "How can we ensure leak-free metagenomic pipelines"?
Salzberg's second charge, that the released dataset still contained human sequences, is not in dispute. But it raises another important issue: what is the ethical obligation if one finds such a leak of potentially personally identifiable data? Notifying the depositor privately could be seen as the right thing to do, as this gives an opportunity for the problem data to be withdrawn prior to drawing attention to it, but not reporting it publicly prevents a public discussion of the correct level of concern. Put another way, if one finds one such dataset is one ethically obligated to check every other dataset in SRA and contact all those authors? And what is the obligation if the author fails to respond? As methods for aligning human sequences get better, should these datasets undergo regular re-cleaning? Or has the horse left the barn at that point?
I don't work in clinical metagenomics space and so have never tried to deposit such a dataset. Do SRA and ENA have mechanisms for third parties to flag datasets as having potentially re-identifying data? Should these databases have their own pipeline for checking or cleaning human metagenomic datasets of human reads? Should they have two-tiered data access, with one tier providing the entire dataset but requiring special access privileges and the other tier having the cleaned data? After all, if a dataset is cleaned of human reads it can't be used to test the ability of software to remove those reads? (Note: I understand that from a clinical perspective, removal of human reads is primarily an optimization and not viewed as critical for success). The worst outcome of this episode would be for authors to be less inclined to deposit clinical metagenomic datasets, as that would deprive the community of valuable real data for tool testing and further mining.
Both Dr. Chiu and Dr. Salzberg have had some further public correspondence, some in the comments on my piece and others on the F1000 site. There's not a lot of agreement on the issue of the bacteria, with Dr. Chiu stating the software found the bacteria but this was reported in very compressed form in the paper and supplementary materials, and in any case the nature of the samples in which the bacteria were found indicated they were unlikely to be the pathogen of concern, either by being in samples from patients negative for the disease or being found in airway samples by not cerebrospinal fluid. Dr. Salzberg is standing by his group's work. One view on the gap between them is a difference in scientific cultures between a clinical focus (Chiu) and a research bioinformatics focus, and perhaps some better mean can be reached. In any case, if you do have access to Lancet and are interested in the topic, perhaps you could weigh in via whatever medium you feel comfortable with.
The major goal of setting up this blog was to make myself think about science in its many facets, and to do so by inviting others to comment and criticize my writing. I'll try harder in the future not to make careless mistakes as happened here, as the fun part is getting challenged on my thinking, not on a gap between my ideals and my writing.
A more serious issue that I have realized via my discussion with Dr. Chiu is that I wrote the original piece largely accepting the claims of the F1000 piece, which was a serious error given that I had not read the Greininger paper. The paper, published in Lancet, is unfortunately inaccessible to me -- another of those early goals was to be no net cost to blogging, and I do not have access to Lancet. Lancet has a strong paywall model in which no content is free, including the supplementary materials. The appropriateness of paywalled science is not a topic I wish to tackle here at this time, only that if I can't actually evaluate the claims made in a paper criticizing another paper, then I should be only reporting them in a neutral tone, avoiding any appearance of taking sides. I failed in that, which I apologize to Dr. Chiu and his group for this serious misstep on my part.
Given how I wrote the piece, the title ("Leaky clinical metagenomics pipelines are a very serious issue") is a bit problematic; it is already skewing the discussion towards assuming the pipeline under question leaked. The remainder of my discussion of the need for standards and good benchmarks stands well, but should have had a more neutral introduction. Titles are frequently a very late addition to a piece (it's amazing I don't routinely forget to add one), and this was a last minute inspiration. If I had been more careful, what might the title have been? Pehraps "How can we ensure leak-free metagenomic pipelines"?
Salzberg's second charge, that the released dataset still contained human sequences, is not in dispute. But it raises another important issue: what is the ethical obligation if one finds such a leak of potentially personally identifiable data? Notifying the depositor privately could be seen as the right thing to do, as this gives an opportunity for the problem data to be withdrawn prior to drawing attention to it, but not reporting it publicly prevents a public discussion of the correct level of concern. Put another way, if one finds one such dataset is one ethically obligated to check every other dataset in SRA and contact all those authors? And what is the obligation if the author fails to respond? As methods for aligning human sequences get better, should these datasets undergo regular re-cleaning? Or has the horse left the barn at that point?
I don't work in clinical metagenomics space and so have never tried to deposit such a dataset. Do SRA and ENA have mechanisms for third parties to flag datasets as having potentially re-identifying data? Should these databases have their own pipeline for checking or cleaning human metagenomic datasets of human reads? Should they have two-tiered data access, with one tier providing the entire dataset but requiring special access privileges and the other tier having the cleaned data? After all, if a dataset is cleaned of human reads it can't be used to test the ability of software to remove those reads? (Note: I understand that from a clinical perspective, removal of human reads is primarily an optimization and not viewed as critical for success). The worst outcome of this episode would be for authors to be less inclined to deposit clinical metagenomic datasets, as that would deprive the community of valuable real data for tool testing and further mining.
Both Dr. Chiu and Dr. Salzberg have had some further public correspondence, some in the comments on my piece and others on the F1000 site. There's not a lot of agreement on the issue of the bacteria, with Dr. Chiu stating the software found the bacteria but this was reported in very compressed form in the paper and supplementary materials, and in any case the nature of the samples in which the bacteria were found indicated they were unlikely to be the pathogen of concern, either by being in samples from patients negative for the disease or being found in airway samples by not cerebrospinal fluid. Dr. Salzberg is standing by his group's work. One view on the gap between them is a difference in scientific cultures between a clinical focus (Chiu) and a research bioinformatics focus, and perhaps some better mean can be reached. In any case, if you do have access to Lancet and are interested in the topic, perhaps you could weigh in via whatever medium you feel comfortable with.
The major goal of setting up this blog was to make myself think about science in its many facets, and to do so by inviting others to comment and criticize my writing. I'll try harder in the future not to make careless mistakes as happened here, as the fun part is getting challenged on my thinking, not on a gap between my ideals and my writing.
1 comment:
Are you aware of #ICanHazPDF? In this age, not having institutional access is not necessarily a barrier to acquiring literature.
Post a Comment