Thursday, December 07, 2017

On the Problem of Sequence Leakage

I've been spending some time lately in an unfamiliar world: the eukaryotic section of NCBI's NR protein database.  I've been almost exclusively a bacterial guy for six years, but the other side of starbase had an interest in find homologs of a particular protein so I went diving for some.  That experience has reminded me of two serious issues with public sequence databases.  Tonight I'll dash off a bit about one; expect the other complaint to show up in the not-so-distant future. And tonight's lament is the increasing dispersion of sequence respositories.

The first bioinformatics program I ever wrote was to key in DNA sequences from the literature.  Back in the 1980s, journals still published DNA sequences and Genbank was grossly behind in collecting them all. There weren't many sequences available for our organism, Chlamydomonas reinhardtii, so we wanted every last one to understand codon usage and such.  A professor let me use her desktop software to key them in and it had a simple verification scheme: just retype the sequence a second time and it would chirp at any discrepancy.  But that meant coordinating schedules and it seemed straightforward, so a bit of hacking in Turbo Pascal and I had a working facsimile.  The only issue left was that many of the sequences were printed in uppercase with 8 pin dot matrix printers, which meant that C and G differed by one pixel - in a genome with about 55% G+C.  A poor photocopy by me or just a lousy ribbon by the original author and I was left squinting for that pesky pixel.

When I was in grad school, control of Genbank shifted to NCBI and somewhere around then most journals vowed to require authors to submit sequences prior to publication.  It was a great idea and made a huge impact.  A few journals were laggards and even then a few sequences slipped away.  Plus there was the definition to sequence: to me even a short splice site mutation should be deposited but that wasn't always the case.  But it made a real difference. And from pretty much all those sequences the proteins were predicted and went into NCBI's NR database.

But then came various high throughput projects -- first ESTs then raw genome shotgun reads and ultimately short reads.  So NCBI invented new corners to stash these datasets.  And journals started getting less diligent about making sure data was deposited in advance.  And increasingly, interesting datasets didn't push data into NR.

I thought of this during my search.  I'm querying with a protein that appears to be essential in metazoan space, at least multicellular ones.  I've been collecting a zoo of sequences from all over the metazoan tree and having fun looking up all sorts of latin names I didn't know.  But then it occurred to me there were some latin names I did know but hadn't seen.  

For example, take Homarus americanus.  Actually, don't try to take it from me or go near my drawn butter -- I've become rather fond of lobster.  I once dreamed of a New England clambake genome project -- HomarusMercenaria mercenaria (clams), Zea mays and Solanum tuberosum -- and probably Bos taurus since all of those taste better with butter.  At this point most of those are probably in somebody's queue -- but where was the lobster at NCBI?

Well, nobody's sequenced the lobster genome yet.  But there have been published papers claiming hte transcriptome.  And indeed, one of them deposited the assembled transcripts in SRA (with the amusing annotation /isolation_source="seafood retailers in ME". But SRA doesn't feed the NR protein database, so no lobster in my hit results.  Which is a bit silly, as this isn't a mass of short reads but an actual attempt at a transcriptome assembly.  Indeed: I pulled the dataset down and found the homolog (lobsterlog?) of my query protein.

Of course, I can go pull in those sequences.  But how many other little corners do I need to search to find all the world's data.  Metagenomes here and transcriptomes there.  And where are they?  Papers describing the sequences are too often not connected to the records in NCBI or EBI or no apparent record exists.  I'm a googleholic, but it gets to be a pain having to do that just to track down databases to search.

It's particularly unfortunate because there are increasing uses for all those sequences.  I've written previously on how very deep multiple alignments can be used to predict structures or protein-protein interaction surfaces.  Trees are informed by increasing numbers of sequences.  Deep alignments can be mined for subbranch-specific residue changes.  And so on and so on -- the biological community benefits if it is straightforward to quickly survey all protein sequences.

I don't have a comprehensive memory of what genomes have been sequenced -- that's quite impossible these days.  But I did realize another omission -- only one tardigrade sequence had shown up and multiple tardigrades have been (somewhat notoriously) sequenced.  Similarly, I found a hagfish sequence but not a lamprey.  Some of these omissions probably represent defects in available assemblies.  Perhaps some are really cases where this protein isn't actually essential -- but how would I ever be confident I've checked every nook of the Internet to tease it out?

Solving the problem won't be easy.  But the first step is to recognize the problem and the second step is for the community to decide it is a problem worth chipping away at.


Scott Edmunds said...

There's also the new challenge of the Chinese Academy of Sciences telling Chinese researchers to deposit their data in their GSA (Genome Sequence Archive) database ( Journals can get confused by this as their accessions look exactly like INSDC ones but are not mirrored or indexed by EBI/NCBI/DDBJ.

Keith Robison said...

This post garnered quite a few comments on Twitter, which I have Storified.