Monday, August 08, 2011

Names in Collision

I will claim that I saw this coming, in that I've toyed with the basic skeleton of this post before.  But, I hadn't gotten around to it -- but how could I miss this opportunity.  On a mailing list devoted to SAM, someone asked about a related topic to SAM, and one of the experts on the board replied with an electronic head-scratching
As you might guess, the problem is that SAM has been overloaded with meanings.  The mailing list concerned Significance Analysis of Microarrays, a tool for analyzing microarray data.  However, the questioner was working with Sequence Alignment/Map format, the standard scheme for storing high-throughput sequencing data aligned to a genome.  But, it could have also been someone asking about Sequence Alignment and Modeling, a package for using Hidden Markov Models to identify distant protein homologs and model protein structure.  Perhaps with that SAM you might try to model the Sterile Alpha Motif or perhaps some enzymes involved in the metabolism of S-Adenosyl Methionine.  Those are all the ones I could pull out of my brain in the general biology space, but of course Wikipedia has a few more that might show up in the field, including Self-Assembling Monolayers and shoot apical meristem.

The same group that gave us Significance Analysis of Microarrays also generated Permutation Analysis of Microarrays, or PAM.   But it is Point Accepted Mutations which predates that by many decades, having been devised & named by bioinformatics pioneer Margaret Dayhoff, whose work in bioinformatics predates my existence by a significant margin.  On my Sun SPARC workstation at Harvard, pam was a member of the original BLAST suite to generate arbitrary pam matrices -- as well as a utility to fix stuck windows on the workstation.  More than once I got one when I needed the other.  Now, of course, if I say SPARC a colleague is much more likely to think extracellular matrix rather than desktop workstation.

Name collisions abound in bioinformatics and biology.  Zooming in on compact, pronounceable acronyms is a major issue.  Scientists naming something in one field appear to rarely check in other fields -- or sometimes in their own field.  In one of the many travails of dealing with cancer cell lines (which could & should & is planned to be a whole series of posts on "The Scandals of Human Cancer Cell Lines"), I've discovered that COSMIC contains three clearly different cell lines named "PC-3".  That's not a knock on COSMIC; they've taken the sanity-preserving decision of not trying to adjudicate insanity out of the literature.  But, checking references confirmed the supposition that cell lines from prostate, lung and pancreas cannot possibly be the same thing.  You'd think some names would be clearly off-limits, but I swear I saw some other package in the broad biology space (which, of course, I now cannot find reference to) which used the acronym BLAST, but had nothing to do with sequence alignment.

It can get serious; I've heard a Nobel laureate biochemist confuse the Anaphase Promoting Complex for the product of the Adenomatous Polyposis Coli gene product; both are APC and both are relevant to cancer, but in different ways.  Immunology throws in Antigen-presenting cell.  Another favorite of mine for checking databases is Fatty Acid Synthase versus the TNF-receptor homolog FAS.  Of course, my thesis advisor was in the Faculty of Arts and Sciences at Harvard, so perhaps that is a natural one for me to fixate on.

Even focused organizations can fall prey to this problem.  Millennium periodically had internal collisions.  For example, we had two tools named SPOT, one for sequence analysis and one for controlling array-spotting robots. When Millennium transferred its platform, this was an issue for the documentation and training folks to wrestle with (I think we eventually renamed one).

There's also the problem created by some of the naming conventions in biology.  This winter, my work apparently stymied my attempt to observe a tenrec at a zoo -- after all, my company is developing a hedgehog inhibitor!  

Then there is the cross-talk inside a company between non-biological disciplines and biological ones, which can sometimes be swapped quickly in a presentation.  Are we talking Regions of Interest or Return on Investment?  If you attempt to patent methods for removing antigen-antibody complexes from solution, is that now IP on IP (which, of course, you will send to your lawyers via a cable using IP).

Is there a solution?  I doubt there is one beyond authors actually searching carefully before naming their tool or newly found protein.  The whimsies of Drosophila biologists will continue to entertain and cause trouble, though other nightmares have come from other disciplines.  For example, what lunatic decided to name two distinct lipoprotein precursors "apolipoprotein A" and "apolipoprotein (a)"?  In the meantime, be on your toes both to avoid confusing yourself and confusing others.  In particular, insist on the correct acronyms if they must be used at all!  An irritant to me at work is dropping the extra A from the acronym for our target "fatty acid amide hydrolase", as to me FAH is a very different human enzyme (fumarate hydrolase).  It's a crazy world out there -- don't make it crazier!


Rick said...

>Is there a solution? I doubt there is one beyond authors actually searching carefully before naming their tool ...

Ah, if only people could think carefully as they create their tools. And, not to be too age-ist, but especially the younger people who haven't run into the confusion via using other people's work (in programming I find the same to be true with documentation and usability -- it is amazing how better a person becomes once they feel the frustration of working with the poorly written programs of other people.)

For example, how many utility programs are called 'convert'? That name is obvious to the author -- heck, the program 'converts' from one format to another so why not call it the obvious name? But once the program suite is out in the world then the utility's name conflicts with other suites. Better to have made a more specific name. But that requires extra thought!

Not much original in the above thoughts but it at least gave my fingers a workout during my early morning coffee. :-)

Noah Fahlgren said...

Shoot Apical Meristem


Noah Fahlgren said...

Oops, you had that one.

Kay Aull said...

Ah, namespace. Sadly, pronounceable 3-5 letter acronyms are a finite resource...though perhaps a renewable resource. I think you can get away with naming something SPARC by now.

Yes, cell lines are hopeless. (AA-N is a typical format, and there are more than 26*26*10 of them kicking about.) My current favorite, though, is the two distinct definitions of RT-PCR (techniques which can be done together, and often are, though there's been a semi-successful attempt to rename one of them).

The drosophila folks have at least tried to broaden the namespace, but the problem is, the names tend to not make sense to anyone but the sleep-deprived grad student who was writing up the first phenotype connected to it. (Well, you've got cancer, so we'll give you these anti-hedgehog pills...) And at least with acronyms, if you've totally missed the mark, you can discreetly reassign the letters after the fact.

Though this does give me some mischievous ideas. Maybe there's a thesis proposal in my future: "Effects of the Anaphase Promoting Complex on the Adenomatous Polyposis Coli Locus in Differentiation of Antigen Presenting Cells".

suicyte said...

You forgot the Activated Protein C :-)
I also had a post on this subject a while ago:
APC is dangerous for your kid