Tuesday, August 23, 2022

SRA Entries Should Not Ever Disappear Into Thin Air

I ran into an annoying problem last night and was quite steamed, but had the discipline to wait until morning to vent publicly about it.  Now I'm more in a morose mood on the subject, not furious but still quite frustrated. The quick version of what happened is I'm belatedly trying to go through some nicely documented reproducible analysis code to explore some concerns I have with the analysis, and the code is working on an SRA entry -- and that SRA entry is the entire point of the analysis. And that SRA entry which I know once existed now doesn't - other than this code and the preprint to go with it, it's as though it never existed -- which is terrible.  And I'm irritated with everyone who contributed to that terrible result, starting with NCBI
Okay, the long version.  Earlier this year Ultima Genomics came out of stealth, and I was one of the few outlets they gave an advance interview to (30 kilobases of infectious RNA kept me from actually visiting).  After that and other hype, Sina Booeshaghi and Lior Pachter of Cal Tech took a critical whack at the data and rapidly produced a BioRxiv preprint that was very critical of Ultima's data quality.  They based this on a preprint from the Broad Institute and Ultima and later added some data from another preprint.  There's aspects of the way Booeshagi & Pachter preprocessed the Ultima data that I have reservations with and I wanted to walk through their pipeline and then see for myself how much changes in the aspect of concern (truncating the Ultima data before alignment or pseudoalignment) would change things.  And I ran into a nasty bump - the SRA entry SRR18145555 for the Ultima data referenced in the code just no longer exists

So starting my brickbat throwing with NCBI, this is the first terrible part.  When I say SRR18145555  I mean it.  If you search SRA for it (and ENA either never got a copy or has also removed it), you get nothing.  No message "this record removed by request of submitter" or such -- just nothing.  It's foolish enough that NCBI allows submitters to withdraw data months later (this is no "oopsie I hit the wrong button" case) and after that data has been the subject of significant public scientific discussion, but to do so with no trace is unconscionable!

This points to a related problem and a potentially very serious one.  The Broad/Ultima preprint says the data can be found in Gene Expression Omnibus at GSE197452.  That entry is interesting and problematic because for platforms it lists four and that's three Illumina boxes and 454 GS.  Which is clearly nonsense, since this was all submitted in 2022.

GSE in turn for each of the three datasets -- and they are correctly labeled Illumina and Ultima points to GSM6297379 as the Ultima dataset (again in GEO).  Which also lists 454 GS as the instrument.  Maybe this was a case of "I gotta pick something so will pick a thematically close box" but it annoys me to have metadata corrupted like this -- people like me like to search that data.  So for starters, could NCBI please add Ultima to their platform list?

Okay, GSM629737 in turn points to SRA experiment  SRX16043372 and that at the moment has two SRA run entries in it and they have higher integers embedded (SRR20002549 and SRR20002550)
so these must be later uploads from the Broad (who owns all of these entries).

But that's really a serious problem, as the Broad preprint makes claims about the data and critically the Broad preprint on BioRxiv was posted May 29th.  This is a very computer programming like issue -- we have a chain of pointers and the last one now points to nothingness; the Broad's actions have caused the Booeshaghi & Pachter preprint code to effectively segfault.  But it also means the top level pointer isn't remotely specific enough -- it truly is terrible that the Simmons et al preprint from the Broad is saying "you can find the data we analyzed under this accession" when ultimately the data that points to can shift with the tides.  If I wish to reproduce the Simmons results, which dataset is correct?  And with the new data, is the May 25th version of the preprint still valid?

Most likely the new data represents new chemistry and/or basecalling software from Ultima and might substantially change Booeshaghi & Pachter's conclusions -- but it's unreasonable to expect the Caltech group to constantly turn a data treadmill.  But it certainly would be interesting to iterate their analysis (and others) on the newer data - but since I can't rerun the analysis on the old data a comparison isn't possible

Ideally we'd have versioning of both the chemistry and the basecaller in the Broad preprint, but there doesn't appear to be any.  Even in the Oxford Nanopore community, and ONT has been very open about this, preprints and publications are all too often missing the critical information as to which pore type, which basecaller, which version of the basecaller, which basecalling model, etc were used -- yes, it is a lot of minutia to collect but it's actually very important minutia for supporting reproducible and comparable science.

The scientific world will never be perfect, but this seems like a series of really obvious and avoidable missteps. Can't we do better?


seanken said...

Sorry for the confusion on this! The 'updated' data should be the same as the old data. Basically it relates to the 454 issue you pointed out--as you guessed we uploaded the Ultima data as 454 (after consulting with GEO) since Ultima wasn't an option yet (hopefully will be added soon!). Unfortunately there were issues with the original upload we didn't catch until after the preprint was released (namely the Ultima data was mislabelled as Illumina by the system, likely due to the way our sample sheet was set up). As such we worked with GEO to get this fixed but I guess in that process it changed the SRA IDs. Not ideal I agree, but all the same data should still be there (as well as some extra data we uploaded for our paper revisions!).

Anonymous said...

