Tuesday, August 29, 2017

The Curse of Spammotation Lives!

High throughput sequencing of genomes is over twenty years old, which demanded the development of automated pipelines for annotating this data.  I've worked on such pipelines since the early 1990s, implementing them as a student and at two different corporate stops.  Indeed, we were reviewing results from my pipeline versus some of the other ones out there to see what can be done better.  And unfortunately, I've found infuriating problems with RefSeq entries annotated with NCBI's bacterial genome annotation pipeline.  Now I'm usually one to sing the praises of NCBI -- they are a key resource for biological research and they make available multiple spectacular public services freely to the entire world.  But I'm afraid this time I need to vent.

As noted above, I was reviewing the annotations from a RefSeq entry with the ones my pipeline assigns.  Now, this isn't easy to do in a truly systematic sense so I ended up writing some quick code to produce output similar to this, listing the CDS coordinates and then the output

47857   49887   NCBI    peptidylprolyl isomerase
47857   49887   KR      SpoIIE:472-654 Stage II sporulation protein E (SpoIIE)
47857   49887   KR      PAS_4:19-123 PAS fold
47857   49887   KR      PAS_3:172-236 PAS fold
47857   49887   KR      GAF_2:274-416 GAF domain
47857   49887   KR      40.0%id SCO0451 Uncharacterized protein

So this CDS from NC_018524.1 corresponding to WP_014912717.1,  seems a bit divergent between the NCBI and KR annotations.  Now, this one jumped out at me because the NCBI annotation: peptidylprolyl isomerase aka rotamases  I happen to know more than I ever thought I'd know about this class of enzyme, as Starbase's sensors were long focused on natural products which are known to bind to these enzymes.  Rotamases are quite cool: they flip proline's ring between the two possible stable states.  There are three classes of rotamases and each has a distinctive domain -- and none of those domains are to be found in this protein.

This problem, which I hearby call "spammotation" (a portmanteau of spam and annotation), has long bedeviled annotation pipelines.  We generate labels for proteins based on similarity, but must thread the needle between being overly cautious and mislabeling.  In particular, because proteins can have multiple domains, it is quite easy to incorrectly transfer an annotation due to false transitivity.  In other words, if protein A is similar to protein B and protein B is similar to protein C, it does not always follow that protein A is similar to protein C.  Many times they will be, but sometimes the bits of B which are similar to A are completely separate than the bits of B which are similar to C.  An early example of this was one of the first yeast chromosome sequences, whose annotation declared that it encoded a chlorophyll-binding protein.  While that isn't impossible, it is certainly highly unlikely given the non-photosynthetic nature of Saccharomyces.

An important quality of any such annotation-by-similarity pipeline is traceability: the evidence for an annotation should be able to tracked from source to destination.  NCBI provides such tracking in the form of notes on the CDS which describe the source of the annotation.  If we look at  WP_014912717.1 we see that this annotation was generated by similarity to -- oh no!  WP_014912717.1!!  Indeed, 2429 out of 5029 CDS in NC_018524 refer to themselves as evidence for their annotation! That's 48.30%!!!  AAAIIIIEEEE!!

Of course, I'd also be griping if there were simply circles of mutual annotation transferal -- A is annotated based on the similarity to B and B on similarity to A.  And in fairness, I haven't found a lot of cases of this in the nucleotide entry -- though there are a whole bunch of homologs of the protein which are all annotated incorrectly as rotamases.

Correct annotation by similarity requires several components.  False transitivity should be avoided. Reasonable cutoffs should be used to assign function.  And critically, a well-curated database of proteins with trusted annotations is needed.  Uniprot is a good source for the latter, if you filter for the highest evidence levels, though unfortunately Uniprot doesn't keep up with every functional assignment of a protein.  So such a database subset will have a false negative problem.

I don't want to suggest that NCBI's pipeline is uniquely problematic; there are bad annotations all over the database.  There is a protein family with which I am intimately familiar which is often misannotated in Genbank entries.  The family has two known enzymatic activities, which we'll just call A and B, and these neatly fall out of a tree of the family.  But that hasn't stopped some entries from labeling A-type proteins with the B activity. Sadly, many RefSeq entries appear to be labeled with a poor description of the A activity (and many of those are probably Bs) -- and many others are just labeled as "hypothetical protein" despite having clear similarity to the family. But then again, an entry submitted from JGI labels a member of this family with yet another activity we'll call C -- which is even more distant on the tree than A or B and a quite different reaction.  So presumably the JGI trusted protein set for annotation generation is missing any proteins with activity A or B.

Automated annotation will never be perfect, but we can always try harder.  Spammotation is perhaps the most frustrating because it means a small error balloons into a larger error which can then further grow.  Alas, the only solution is that frequently propounded by Alastor Moody: Constant Vigilance!

1 comment:

Guy said...

Preaching to the choir, Keith! The biggest issue I have with the new RefSeq proteins, the ones with WP_# accessions, is that they represent a collection of identical protein sequences (not necessarily identical coding sequences) for which some annotation is assigned -- and you often cannot tell where that annotation came from. I remember a case where there were four identical sequences in nr, collapsed down to a single RefSeq protein entry. One was annotated as a hypothetical protein, but the other three were erroneously annotated as a specific enzyme even though they bore none of the characterized hallmarks of said enzyme. And by some process (majority rule?) the WP_ entry was annotated as that enzyme. When I got a near identity match to it I thought "great -- someone figured out what that hypothetical protein actually does!" ... Nope!