Omics! Omics!: A Fatally Flawed Paper

I like to review manuscripts but don't do so very often. When I started this blog I thought I might often use it to play "If I had been the reviewer", but I haven't done that much. However, a paper came to my attention that I can't stop thinking about until I tackle it here.

As an aside, I find papers I review to fall into three categories. The first are very solid papers that I can find little to comment on; I might make a suggestion or two (often about data visualization), but if the core is solid there isn't much for the reviewer to do. The second category is the most frustrating: when I feel the paper is on the edges of my expertise & I start to question whether I should have agreed to review it (which is done after seeing an abstract). The third category is the one I can really dig into: seriously flawed papers. I think one of my reviews of a paper was approaching the length of the manuscript; the paper was badly flawed but there was a thread of substance that with a lot of work could be turned into something decent.

Anyway, I noticed this paper in the BioMedCentral Table of Contents extract which emailed to me weekly.

Comparative kinomics of human and chimpanzee reveals unique kinship and functional diversity generated by new domain combinations.

. Now, back at MLNM I had for a while specialized in protein kinases, so it is a field of some interest. I hadn't kept up with the status of the chimpanzee genome sequencing, but there is a longstanding familial interest in this species so that was another angle of interest.

Sometimes when there has been some accident, a review of the circumstances leading up to it will reveal many opportunities for recognizing that a bad situation had been set up: the engineer ignored a stop signal or the dispatcher should have noticed the switch was set incorrectly. This paper, particularly one of its centerpiece findings, has that feel to it: there were many warning flags that something was amiss, but unfortunately the authors and the reviewers failed to see them.

When I first planned this critique, I was going to detail several examples. However, that would seem to lead to a very long post, so I will pick a few examples and claim that it is representative. If anyone wishes to challenge that claim, then I'll flesh out some more. Also, I feel the first example is particularly apropos because it is a bit of a centerpiece; it gets a lot of space (including a special figure) in the text.

It was this bit of text that caused me to raise my eyebrows as far as they could go (I wish I could do the Spock single-eyebrow raise, but I can't). The bolding is mine to emphasize the big surprises.

For example, a chimpanzee kinase classified as casein kinase 1 (ENSPTRP00000001150) on the basis of significant sequence similarity (31%) of the catalytic domain and excellent e-value (2e-16) with the casein kinase 1 from human. However this chimp kinase has a POLO BOX tethered to the kinase catalytic domain.

Thus this chimp kinase represents a hybrid CK1_POLO kinase. Interestingly ENSEMBL reports that ENSPTRP00000001150 has a high similarity with the human kinase ENSP00000361275. However, according to our classification protocol ENSP00000361275 is classified as a POLO kinase on the basis of 52% sequence identity with classical POLO kinases and excellent e-value of e-112. Figure 1 shows the dendrogram of the CK1 sub-family of kinases and it highlights the significant divergence of chimp homologue from its counterparts in other organisms

The first huge surprise is to find a kinase with so little sequence identity to its closest human counterpart. The DNA identity of human and chimp is routinely cited in the high 90 percent (how exactly you calculate it affects the final value) and they are our closest relatives. Finding a human-mouse ortholog identity of less than 31% would be stunning; for human-chimp it would be indescribably surprising. The second huge surprise is the claim of a hybrid Polo-CK1 kinase. The Polo box is a domain which recognizes phosphorylated peptides and is important in the activation & substrate recognition by Polo kinases. It is the signature of the Polo subfamily and has not been reported to be found on any other protein. The third surprise is in the dendrogram; it is claimed that this kinase has an affinity to CK1-type kinaess, but in their rooted dendrogram (source of rooting not explained, a serious error) this kinase is an outgroup to all of the other presented kinases! Without some true outgroups (ideally representatives of other key families), how can we tell what it is most similar to?

Now, a strong criticism of mine of this paper is that it relies too much on Ensembl-derived sequences and annotation. Ensembl is a great system & I have high respect for it, but it is also trying to do the very complex job of integrating a lot of other data with genomic sequences of varying quality and we are not scientists if we fully trust it to always be correct. It is much better to have a more definitive reference point; why rely on someone's hand sketched map if you have a USGS topographic section available? And for a solid anchor database, it is hard to beat the RefSeq human protein dataset. So, we take the sequence from their figure for this ORF


>3|Chimp|ENSPTRP00000001150
SLAHIWKARHTLLEPEVRYYLRQILSGLKYLHQRGILHRDLKLGNFFITENMELKVGDF
GLAARLEPPEQRKKTICGTPNYVAPEVLLRQGHGPEADVWSLGCVMYTLLCGSPPFETA
DLKETYRCIKQVHYTLPASLSLPARQLLAAILRASPRDRPSIDQILRHDFFTKGYTPDR
LPISSCVTVPDLTPPNPARSLFAKVTKSLFGRKKKKSKNHAQESDEVSGLVSGLMRTSV
GHQDARPEAPAASGPAPVSLVETAPEDSSPRGTLASSGDGFEEGLTVATVVESALCALR
NCVAFMPPAEQNPAPLAQPEPLVWVSKWVDYGGDLPSVEEVEVPAPPLLLQWVKTDQAL
LMLFSDGTVQVNFYGDHTKLILSGWEPLLVTFVARNRSACTYLASHLRQLGCSPDLRQRLRYALRLLRDRSPA

and our top hit is


 GENE ID: 1263 PLK3 | polo-like kinase 3 (Drosophila) [Homo sapiens]
(Over 10 PubMed links)

 Score =  663 bits (1710),  Expect = 0.0, Method: Compositional matrix adjust.
 Identities = 328/362 (90%), Positives = 335/362 (92%), Gaps = 14/362 (3%)

That resolves all these questions: it's a straightforward ortholog of PLK3 (which explains the Polo boxes), not some noteworthy hybrid and the sequence identity is 90+% -- and that score is dropped a lot by some iffy regions like this


Query  301  MPPAEQNPAPLAQPEPLVWVSKWVDYGGDLPSVEEVEVPAPPLLLQWVKTDQALLMLFSD  360
            MPPAEQNPAPLAQPEPLVWVSKWVDY                    +  + + + +LF+D
Sbjct  445  MPPAEQNPAPLAQPEPLVWVSKWVDYSNKFG-------------FGYQLSSRRVAVLFND  491

Query  361  GT  362
            GT
Sbjct  492  GT  493


 Score =  176 bits (446),  Expect = 1e-43, Method: Compositional matrix adjust.
 Identities = 83/91 (91%), Positives = 87/91 (95%), Gaps = 0/91 (0%)

Query  319  WVSKWVDYGGDLPSVEEVEVPAPPLLLQWVKTDQALLMLFSDGTVQVNFYGDHTKLILSG  378
            ++ + +  GGDLPSVEEVEVPAPPLLLQWVKTDQALLMLFSDGTVQVNFYGDHTKLILSG
Sbjct  538  YMEQHLMKGGDLPSVEEVEVPAPPLLLQWVKTDQALLMLFSDGTVQVNFYGDHTKLILSG  597

Query  379  WEPLLVTFVARNRSACTYLASHLRQLGCSPD  409
            WEPLLVTFVARNRSACTYLASHLRQLGCSPD
Sbjct  598  WEPLLVTFVARNRSACTYLASHLRQLGCSPD  628

What's going on there? Well, most likely this is underlining the draft nature of the chimpanzee genome. I checked with TBLASTN, and there aren't ESTs around this region -- the chimp PLK3 is pretty much a pure gene prediction model -- a tough problem that has been tackled well but never perfectly. Plus, the underlying genomic data is, well, draft quality. Another TBLASTN search revealed that although this Ensembl prediction is from the middle of a large contig, the N-terminus of human PLK3 has a great match on another contig -- but from the same chromosome.

Okay, maybe that's a fluke. So here's another chimp kinase highlighted in the text

A protein (ENSPTRP00000000076), classified under PKC subfamily, is composed of a PB1 domain followed by the protein kinase domain which is followed by a protein kinase C terminal domain (Figure 3a1). The PB1 domain is present in many eukaryotic cytoplasmic signalling proteins and is responsible, although not systematically, in the formation of PB1 dimers [25]. It thus serves as a molecular recognition module. This architecture is known so far only in an atypical PKC of Phallusia mammilata, a sea squirt. Our analysis identified two chimpanzee PKCs and a human PKC with a similar architecture, in which a phorbol esters/diacylglycerol binding domain is inserted between the PB1 and the protein kinase domain. The presence of the phorbol esters/diacylglycerol binding domain in combination with the protein kinase and a PKC terminal domain indicates that it is probably responsible for the recruitment of diacylglycerol, which in turns might be involved in activation of the kinase. The deletion of this domain in chimpanzee PKC (ENSPTRP00000000076) implies that the recruitment of diacylglycerol might be achieved by an external interacting module.

Again, the first thing to do is to search the ORF

>1|Chimp|ENSPTRP00000000076
MPSRTGPKMEGSGGRVRLKAHYGGDIFITSVDAATTFEELCEEVRDMCRLHQQHPL
TLKWVDSEGDPCTVSSQMELEEAFRLARQCRDEGLIIHVFPSTPEQPGLPCPGEDK
SIYRRGARRWRKLYCANGHLFQAKRFNRDSVMPSQEPPVDDKNEDADLPSEETDGI
AYISSSRKHDSIKDDSEDLKPVIDGMDGIKISQGLGLQDFDLIRVIGRGSYAKVLL
VRLKKNDQIYAMKVVKKELVHDDETTSRLFLVIEYVNGGDLMFHMQRQRKLPEEHA
RFYAAEICIALNFLHERGIIYRDLKLDNVLLDADGHIKLTDYGMCKEGLGPGDTTS
TFCGTPNYIAPEILRGEEYGFSVDWWALGVLMFEMMAGRSPFDIITDNPDMNTEDY
LFQVILEKPIRIPRFLSVKASHVLKGFLNKDPKERLGCRPQTGFSDIKSHAFFRSI
DWDLLEKKQALPPFQPQITDDYGLDNFDTQFTSEPVQLTPDDEDAIKRIDQSEFEG
FEYINPLLLSTEESV

against human RefSeq to get our bearings.


 GENE ID: 5590 PRKCZ | protein kinase C, zeta [Homo sapiens]
(Over 100 PubMed links)

 Score = 1031 bits (2665),  Expect = 0.0, Method: Compositional matrix adjust.
 Identities = 518/592 (87%), Positives = 518/592 (87%), Gaps = 73/592 (12%)

Query  1    MPSRTGPKMEGSGGRVRLKAHYGGDIFITSVDAATTFEELCEEVRDMCRLHQQHPLTLKW  60
            MPSRTGPKMEGSGGRVRLKAHYGGDIFITSVDAATTFEELCEEVRDMCRLHQQHPLTLKW
Sbjct  1    MPSRTGPKMEGSGGRVRLKAHYGGDIFITSVDAATTFEELCEEVRDMCRLHQQHPLTLKW  60

Query  61   VDSEGDPCTVSSQMELEEAFRLARQCRDEGLIIHVFPSTPEQPGLPCPGEDKSIYRRGAR  120
            VDSEGDPCTVSSQMELEEAFRLARQCRDEGLIIHVFPSTPEQPGLPCPGEDKSIYRRGAR
Sbjct  61   VDSEGDPCTVSSQMELEEAFRLARQCRDEGLIIHVFPSTPEQPGLPCPGEDKSIYRRGAR  120

Query  121  RWRKLYCANGHLFQAKRFNR----------------------------------------  140
            RWRKLY ANGHLFQAKRFNR                                        
Sbjct  121  RWRKLYRANGHLFQAKRFNRRAYCGQCSERIWGLARQGYRCINCKLLVHKRCHGLVPLTC  180

Query  141  ----DSVMPSQEPPVDDKNEDADLPSEETDGIAYISSSRKHDSIKDDSEDLKPVIDGMDG  196
                DSVMPSQEPPVDDKNEDADLPSEETDGIAYISSSRKHDSIKDDSEDLKPVIDGMDG
Sbjct  181  RKHMDSVMPSQEPPVDDKNEDADLPSEETDGIAYISSSRKHDSIKDDSEDLKPVIDGMDG  240

Query  197  IKISQGLGLQDFDLIRVIGRGSYAKVLLVRLKKNDQIYAMKVVKKELVHDDE--------  248
            IKISQGLGLQDFDLIRVIGRGSYAKVLLVRLKKNDQIYAMKVVKKELVHDDE        
Sbjct  241  IKISQGLGLQDFDLIRVIGRGSYAKVLLVRLKKNDQIYAMKVVKKELVHDDEDIDWVQTE  300

Query  249  ---------------------TTSRLFLVIEYVNGGDLMFHMQRQRKLPEEHARFYAAEI  287
                                 TTSRLFLVIEYVNGGDLMFHMQRQRKLPEEHARFYAAEI
Sbjct  301  KHVFEQASSNPFLVGLHSCFQTTSRLFLVIEYVNGGDLMFHMQRQRKLPEEHARFYAAEI  360

Query  288  CIALNFLHERGIIYRDLKLDNVLLDADGHIKLTDYGMCKEGLGPGDTTSTFCGTPNYIAP  347
            CIALNFLHERGIIYRDLKLDNVLLDADGHIKLTDYGMCKEGLGPGDTTSTFCGTPNYIAP
Sbjct  361  CIALNFLHERGIIYRDLKLDNVLLDADGHIKLTDYGMCKEGLGPGDTTSTFCGTPNYIAP  420

Query  348  EILRGEEYGFSVDWWALGVLMFEMMAGRSPFDIITDNPDMNTEDYLFQVILEKPIRIPRF  407
            EILRGEEYGFSVDWWALGVLMFEMMAGRSPFDIITDNPDMNTEDYLFQVILEKPIRIPRF
Sbjct  421  EILRGEEYGFSVDWWALGVLMFEMMAGRSPFDIITDNPDMNTEDYLFQVILEKPIRIPRF  480

Query  408  LSVKASHVLKGFLNKDPKERLGCRPQTGFSDIKSHAFFRSIDWDLLEKKQALPPFQPQIT  467
            LSVKASHVLKGFLNKDPKERLGCRPQTGFSDIKSHAFFRSIDWDLLEKKQALPPFQPQIT
Sbjct  481  LSVKASHVLKGFLNKDPKERLGCRPQTGFSDIKSHAFFRSIDWDLLEKKQALPPFQPQIT  540

Query  468  DDYGLDNFDTQFTSEPVQLTPDDEDAIKRIDQSEFEGFEYINPLLLSTEESV  519
            DDYGLDNFDTQFTSEPVQLTPDDEDAIKRIDQSEFEGFEYINPLLLSTEESV
Sbjct  541  DDYGLDNFDTQFTSEPVQLTPDDEDAIKRIDQSEFEGFEYINPLLLSTEESV  592

Okay, so the human protein has been described previously: it is human protein kinase C zeta. Has the PB1 domain in PKCzeta and its implications been previously discussed? A quick PubMed search turned up two papers from earlier this decade (in Molecular Cell & JBC) which actually demonstrated the dimerization potential of the PKCzeta PB1 domain. So the PKC domain with a PB1 domain is not novel & noteworthy. What about the missing diacylglycerol-binding domain (that first big gap) in the chimp kinase? That could be interesting, so let's see what whether we can find any EST evidence to support it. Alas, the only EST evidence refutes it and identifies the gap as spurious(and both of these ESTs were deposited in October 2007 and the paper submitted in March 2008, so they are not an unfair criticism)


>dbj|DC524857.1|  DC524857 chimpanzee brain cDNA library PflB Pan troglodytes verus 
cDNA clone PflB8010 5', mRNA sequence.
Length=404

 Score =  108 bits (270),  Expect(2) = 3e-31, Method: Composition-based stats.
 Identities = 57/102 (55%), Positives = 58/102 (56%), Gaps = 44/102 (43%)
 Frame = +2

Query  112  KSIYRRGARRWRKLYCANGHLFQAKRFNR-------------------------------  140
            +SIYRRGARRWRKLYCANGHLFQAKRFNR                               
Sbjct  38   ESIYRRGARRWRKLYCANGHLFQAKRFNRRAYCGQCSERIWGLARQGYRCINCKLLVHKR  217

Query  141  -------------DSVMPSQEPPVDDKNEDADLPSEETDGIA  169
                         DSVMPSQEPPVDDKNEDADLPSEETDGIA
Sbjct  218  CHGLVPLTCRKHMDSVMPSQEPPVDDKNEDADLPSEETDGIA  343


 Score = 42.7 bits (99),  Expect(2) = 3e-31, Method: Compositional matrix adjust.
 Identities = 20/22 (90%), Positives = 21/22 (95%), Gaps = 0/22 (0%)
 Frame = +3

Query  168  IAYISSSRKHDSIKDDSEDLKP  189
            + YISSSRKHDSIKDDSEDLKP
Sbjct  339  LLYISSSRKHDSIKDDSEDLKP  404


>dbj|DC519886.1|  DC519886 chimpanzee brain cDNA library PccB Pan troglodytes verus 
cDNA clone PccB0482 5', mRNA sequence.
Length=612

 Score =  114 bits (284),  Expect = 3e-26, Method: Compositional matrix adjust.
 Identities = 63/107 (58%), Positives = 63/107 (58%), Gaps = 44/107 (41%)
 Frame = +2

Query  113  SIYRRGARRWRKLYCANGHLFQAKRFNR--------------------------------  140
            SIYRRGARRWRKLYCANGHLFQAKRFNR                                
Sbjct  290  SIYRRGARRWRKLYCANGHLFQAKRFNRRAYCGQCSERIWGLARQGYRCINCKLLVHKRC  469

Query  141  ------------DSVMPSQEPPVDDKNEDADLPSEETDGIAYISSSR  175
                        DSVMPSQEPPVDDKNEDADLPSEETDGIAYISSSR
Sbjct  470  HGLVPLTCRKHMDSVMPSQEPPVDDKNEDADLPSEETDGIAYISSSR  610

I checked in detail one more note (about the chimp protein ENSPTRP00000001185 and its human ortholog) about a domain architecture claimed to be unique to human & chimp due to a missing domain. Again, the RefSeq protein search revealed that the chimp protein is nearly identical to a known human kinase (MARK2) albeit greatly truncated -- and the missing domain is beyond the truncation point.

I haven't checked every kinase in the paper, but seeing the same classes of mistakes repeatedly doesn't give much hope. Comparing the human & chimp kinomes (or any other well-defined subset of genes) is a worthwhile enterprise -- so long as it is kept in mind that the chimp genome is a very rough draft and all appropriate computational controls are used. This paper, unfortunately, shows no awareness of either of these principles.

What irks me most about this sort of paper is that it gives all of us a bit of a black eye. Someone who saw the abstract & got excited would be in for a big letdown. It's hard enough to earn the respect of bench biologists without it being tossed away with poorly done analyses.

So what are the positive lessons to be learned? Here are a few tips

Always try to find meaningful biological names for your sequences. Use them in your figures & search them in the literature like a bloodhound.

Always check genomic predictions against EST & cDNA databases.

Always try to root your phylogenetic trees, unless you have a really good reason not to do so. And, if your tree is rooted, you must explain how you rooted it

If your results sound amazing, take a deep breath & think of several tests that could debunk them. Then do those ten tests. If they survive, go to bed & think of another batch of tests.

P.S. One way to put reviewers in a bad mood is to not supply your sequences. The supplementary materials for this paper do not have all of the ORFs; I pulled some out from their alignments with a custom script. Elsewhere via Google I found a collection linked to the work -- but with the whole predicted chimp proteome in it! Very unwieldy & slow to download!
P.P.S. For anyone interested in exploring further, here are the other sequences from the alignment in additional file 3. The number in the header was added by my script to indicate which alignment within that file the sequence was taken from.


>2|Chimp|ENSPTRP00000019171
MSAEVRLRRLQQLVLDPGFLGLEPLLDLLLGVHQELGASELAQDKYVADFLQWAEPIVVRL
KEVRLQRDDFEILKVIGRGAFSEVAVVKMKQTGQVYAMKIMNKWDMLKRGEVSCFREERDV
LVNGDRRWITQLHFAFQDENYLYLVMEYYVGGDLLTLLSKFGERIPAEMARFYLAEIVMAI
DSVHRLGYVHRDIKPDNILLDRCGHIRLADFGSCLKLRADGTVRSLVAVGTPDYLSPEILQ
AVGGGPGTGSYGPECDWWALGVFAYEMFYGQTPFYADSTAETYGKIVHYKEHLSLPLVDEG
VPEEARDFIQRLLCPPETRLGRGGAGDFRTHPFFFGLDWDGLRDSVPPFTPDFEGATDTCN
FDLVEDGLTAMVSGGGETLSDIREGAPLGVHLPFVGYSYSCMALRDSEVPGPTPMELEAEQ
LLEPHVQAPSLEPSVSPQDETAEVAVPAAVPAAEAEAEVTLRELQEALEEEVLTRQSLSRE
MEAIRTDNQNFASQLREAEARNRDLEAHVRQLQERMELLQAEGATAVTGVPSPRATDPPSH
VPWPGLSXALSLLLFAVVLSRAAALGCLGLVAPAGXLXAVWRRPGAARAPX
>4|Chimp|ENSPTRP00000011569
MSDVAIVKEGWLHKRGEYIKTWRPRYFLLKNDGTFIGYKERPQDVDQREAPLNNFSVAQCQ
LMKTERPRPNTFIIRCLQWTTVIERTFHVETPEEREEWTTAIQTVADGLKKQEEEEMDFRS
GSPSDNSGAEEMEVSLAKPKHRVTMNEFEYLKLLGKGTFGKVILVKEKATGRYYAMKILKK
EVIVAKDEVAHTLTENRVLQNSRHPFLTALKYSFQTHDRLCFVMEYANGGELFFHLSRERV
FSEDRARFYGAEIVSALDYLHSEKNVVYRDLKLENLMLDKDGHIKITDFGLCKEGIKDGAT
MKTFCGTSEYLAPRLSPPFKPQVTSETDTRYFDEEFTAQMITITPP
DQDDSMECVDSERRPHFPQFSYSASGTA

7 comments:

Jonathan BadgerTuesday, February 03, 2009 11:04:00 AM
You wrote: Always try to root your phylogenetic trees, unless you have a really good reason to do so. And, if your tree is rooted, you must explain how you rooted it

Are you trying to say:
1) Always try to root your phylogenetic trees, unless you have a really good reason not to do so.

or

2) Never root your phylogenetic trees, unless you have a really good reason to do so.

There are arguments for rooting and not rooting, so I'm not just being picky about grammar; I really don't know which side you are supporting.
Keith RobisonTuesday, February 03, 2009 1:07:00 PM
DUH!!! Thanks for catching that missing "NOT". I probably would, on thinking about it, write something different, such as "If you haven't explicitly rooted your tree, then don't treat it as a rooted tree".
AnonymousWednesday, February 04, 2009 12:35:00 PM
Good catch, Keith. Unfortunately everyone is now an expert and uses published information rather than revalidating what are in actuality tentative results.

I do a lot of reanalysis of genomics projects as a consultant. Guess it's true that there's always time (and money) to do something twice, but never time to do it once properly.

- Brian Moldover (and hi!)
AnonymousWednesday, February 04, 2009 1:12:00 PM
This is hilarious. I remember slaugthering an earlier version of this manuscript when it was in review at a different journal. I guess you can get anything published if you just shop it around long enough...
AnonymousWednesday, February 04, 2009 2:56:00 PM
Can I use your example to craft a bioinformatics laboratory for a class?
Keith RobisonWednesday, February 04, 2009 5:04:00 PM
I meant to plant the idea that this paper could be a useful starting point for a bioinformatics course exercise. My grad school professors loved to require us to read seriously flawed papers (in one case, a complete fraud) to see what we would pick up on (none of us figured out the fraudulent paper was such, a good exercise in instilling humility!)
Chris CotsapasSunday, February 22, 2009 1:21:00 PM
@AnonymousReviewer - one of my great frustrations with peer review is spending a non-trivial amount of time constructively criticising a manuscript, only to see it resurface in all its flawed glory at some other journal. Gah!

Monday, February 02, 2009

A Fatally Flawed Paper

7 comments: