As an aside, I find papers I review to fall into three categories. The first are very solid papers that I can find little to comment on; I might make a suggestion or two (often about data visualization), but if the core is solid there isn't much for the reviewer to do. The second category is the most frustrating: when I feel the paper is on the edges of my expertise & I start to question whether I should have agreed to review it (which is done after seeing an abstract). The third category is the one I can really dig into: seriously flawed papers. I think one of my reviews of a paper was approaching the length of the manuscript; the paper was badly flawed but there was a thread of substance that with a lot of work could be turned into something decent.
Anyway, I noticed this paper in the BioMedCentral Table of Contents extract which emailed to me weekly.
Comparative kinomics of human and chimpanzee reveals unique kinship and functional diversity generated by new domain combinations.. Now, back at MLNM I had for a while specialized in protein kinases, so it is a field of some interest. I hadn't kept up with the status of the chimpanzee genome sequencing, but there is a longstanding familial interest in this species so that was another angle of interest.
Sometimes when there has been some accident, a review of the circumstances leading up to it will reveal many opportunities for recognizing that a bad situation had been set up: the engineer ignored a stop signal or the dispatcher should have noticed the switch was set incorrectly. This paper, particularly one of its centerpiece findings, has that feel to it: there were many warning flags that something was amiss, but unfortunately the authors and the reviewers failed to see them.
When I first planned this critique, I was going to detail several examples. However, that would seem to lead to a very long post, so I will pick a few examples and claim that it is representative. If anyone wishes to challenge that claim, then I'll flesh out some more. Also, I feel the first example is particularly apropos because it is a bit of a centerpiece; it gets a lot of space (including a special figure) in the text.
It was this bit of text that caused me to raise my eyebrows as far as they could go (I wish I could do the Spock single-eyebrow raise, but I can't). The bolding is mine to emphasize the big surprises.
For example, a chimpanzee kinase classified as casein kinase 1 (ENSPTRP00000001150) on the basis of significant sequence similarity (31%) of the catalytic domain and excellent e-value (2e-16) with the casein kinase 1 from human. However this chimp kinase has a POLO BOX tethered to the kinase catalytic domain.
Thus this chimp kinase represents a hybrid CK1_POLO kinase. Interestingly ENSEMBL reports that ENSPTRP00000001150 has a high similarity with the human kinase ENSP00000361275. However, according to our classification protocol ENSP00000361275 is classified as a POLO kinase on the basis of 52% sequence identity with classical POLO kinases and excellent e-value of e-112. Figure 1 shows the dendrogram of the CK1 sub-family of kinases and it highlights the significant divergence of chimp homologue from its counterparts in other organisms
The first huge surprise is to find a kinase with so little sequence identity to its closest human counterpart. The DNA identity of human and chimp is routinely cited in the high 90 percent (how exactly you calculate it affects the final value) and they are our closest relatives. Finding a human-mouse ortholog identity of less than 31% would be stunning; for human-chimp it would be indescribably surprising. The second huge surprise is the claim of a hybrid Polo-CK1 kinase. The Polo box is a domain which recognizes phosphorylated peptides and is important in the activation & substrate recognition by Polo kinases. It is the signature of the Polo subfamily and has not been reported to be found on any other protein. The third surprise is in the dendrogram; it is claimed that this kinase has an affinity to CK1-type kinaess, but in their rooted dendrogram (source of rooting not explained, a serious error) this kinase is an outgroup to all of the other presented kinases! Without some true outgroups (ideally representatives of other key families), how can we tell what it is most similar to?
Now, a strong criticism of mine of this paper is that it relies too much on Ensembl-derived sequences and annotation. Ensembl is a great system & I have high respect for it, but it is also trying to do the very complex job of integrating a lot of other data with genomic sequences of varying quality and we are not scientists if we fully trust it to always be correct. It is much better to have a more definitive reference point; why rely on someone's hand sketched map if you have a USGS topographic section available? And for a solid anchor database, it is hard to beat the RefSeq human protein dataset. So, we take the sequence from their figure for this ORF
and our top hit is
GENE ID: 1263 PLK3 | polo-like kinase 3 (Drosophila) [Homo sapiens]
(Over 10 PubMed links)
Score = 663 bits (1710), Expect = 0.0, Method: Compositional matrix adjust.
Identities = 328/362 (90%), Positives = 335/362 (92%), Gaps = 14/362 (3%)
That resolves all these questions: it's a straightforward ortholog of PLK3 (which explains the Polo boxes), not some noteworthy hybrid and the sequence identity is 90+% -- and that score is dropped a lot by some iffy regions like this
Query 301 MPPAEQNPAPLAQPEPLVWVSKWVDYGGDLPSVEEVEVPAPPLLLQWVKTDQALLMLFSD 360
MPPAEQNPAPLAQPEPLVWVSKWVDY + + + + +LF+D
Sbjct 445 MPPAEQNPAPLAQPEPLVWVSKWVDYSNKFG-------------FGYQLSSRRVAVLFND 491
Query 361 GT 362
Sbjct 492 GT 493
Score = 176 bits (446), Expect = 1e-43, Method: Compositional matrix adjust.
Identities = 83/91 (91%), Positives = 87/91 (95%), Gaps = 0/91 (0%)
Query 319 WVSKWVDYGGDLPSVEEVEVPAPPLLLQWVKTDQALLMLFSDGTVQVNFYGDHTKLILSG 378
++ + + GGDLPSVEEVEVPAPPLLLQWVKTDQALLMLFSDGTVQVNFYGDHTKLILSG
Sbjct 538 YMEQHLMKGGDLPSVEEVEVPAPPLLLQWVKTDQALLMLFSDGTVQVNFYGDHTKLILSG 597
Query 379 WEPLLVTFVARNRSACTYLASHLRQLGCSPD 409
Sbjct 598 WEPLLVTFVARNRSACTYLASHLRQLGCSPD 628
What's going on there? Well, most likely this is underlining the draft nature of the chimpanzee genome. I checked with TBLASTN, and there aren't ESTs around this region -- the chimp PLK3 is pretty much a pure gene prediction model -- a tough problem that has been tackled well but never perfectly. Plus, the underlying genomic data is, well, draft quality. Another TBLASTN search revealed that although this Ensembl prediction is from the middle of a large contig, the N-terminus of human PLK3 has a great match on another contig -- but from the same chromosome.
Okay, maybe that's a fluke. So here's another chimp kinase highlighted in the text
A protein (ENSPTRP00000000076), classified under PKC subfamily, is composed of a PB1 domain followed by the protein kinase domain which is followed by a protein kinase C terminal domain (Figure 3a1). The PB1 domain is present in many eukaryotic cytoplasmic signalling proteins and is responsible, although not systematically, in the formation of PB1 dimers . It thus serves as a molecular recognition module. This architecture is known so far only in an atypical PKC of Phallusia mammilata, a sea squirt. Our analysis identified two chimpanzee PKCs and a human PKC with a similar architecture, in which a phorbol esters/diacylglycerol binding domain is inserted between the PB1 and the protein kinase domain. The presence of the phorbol esters/diacylglycerol binding domain in combination with the protein kinase and a PKC terminal domain indicates that it is probably responsible for the recruitment of diacylglycerol, which in turns might be involved in activation of the kinase. The deletion of this domain in chimpanzee PKC (ENSPTRP00000000076) implies that the recruitment of diacylglycerol might be achieved by an external interacting module.
Again, the first thing to do is to search the ORF
against human RefSeq to get our bearings.
GENE ID: 5590 PRKCZ | protein kinase C, zeta [Homo sapiens]
(Over 100 PubMed links)
Score = 1031 bits (2665), Expect = 0.0, Method: Compositional matrix adjust.
Identities = 518/592 (87%), Positives = 518/592 (87%), Gaps = 73/592 (12%)
Query 1 MPSRTGPKMEGSGGRVRLKAHYGGDIFITSVDAATTFEELCEEVRDMCRLHQQHPLTLKW 60
Sbjct 1 MPSRTGPKMEGSGGRVRLKAHYGGDIFITSVDAATTFEELCEEVRDMCRLHQQHPLTLKW 60
Query 61 VDSEGDPCTVSSQMELEEAFRLARQCRDEGLIIHVFPSTPEQPGLPCPGEDKSIYRRGAR 120
Sbjct 61 VDSEGDPCTVSSQMELEEAFRLARQCRDEGLIIHVFPSTPEQPGLPCPGEDKSIYRRGAR 120
Query 121 RWRKLYCANGHLFQAKRFNR---------------------------------------- 140
Sbjct 121 RWRKLYRANGHLFQAKRFNRRAYCGQCSERIWGLARQGYRCINCKLLVHKRCHGLVPLTC 180
Query 141 ----DSVMPSQEPPVDDKNEDADLPSEETDGIAYISSSRKHDSIKDDSEDLKPVIDGMDG 196
Sbjct 181 RKHMDSVMPSQEPPVDDKNEDADLPSEETDGIAYISSSRKHDSIKDDSEDLKPVIDGMDG 240
Query 197 IKISQGLGLQDFDLIRVIGRGSYAKVLLVRLKKNDQIYAMKVVKKELVHDDE-------- 248
Sbjct 241 IKISQGLGLQDFDLIRVIGRGSYAKVLLVRLKKNDQIYAMKVVKKELVHDDEDIDWVQTE 300
Query 249 ---------------------TTSRLFLVIEYVNGGDLMFHMQRQRKLPEEHARFYAAEI 287
Sbjct 301 KHVFEQASSNPFLVGLHSCFQTTSRLFLVIEYVNGGDLMFHMQRQRKLPEEHARFYAAEI 360
Query 288 CIALNFLHERGIIYRDLKLDNVLLDADGHIKLTDYGMCKEGLGPGDTTSTFCGTPNYIAP 347
Sbjct 361 CIALNFLHERGIIYRDLKLDNVLLDADGHIKLTDYGMCKEGLGPGDTTSTFCGTPNYIAP 420
Query 348 EILRGEEYGFSVDWWALGVLMFEMMAGRSPFDIITDNPDMNTEDYLFQVILEKPIRIPRF 407
Sbjct 421 EILRGEEYGFSVDWWALGVLMFEMMAGRSPFDIITDNPDMNTEDYLFQVILEKPIRIPRF 480
Query 408 LSVKASHVLKGFLNKDPKERLGCRPQTGFSDIKSHAFFRSIDWDLLEKKQALPPFQPQIT 467
Sbjct 481 LSVKASHVLKGFLNKDPKERLGCRPQTGFSDIKSHAFFRSIDWDLLEKKQALPPFQPQIT 540
Query 468 DDYGLDNFDTQFTSEPVQLTPDDEDAIKRIDQSEFEGFEYINPLLLSTEESV 519
Sbjct 541 DDYGLDNFDTQFTSEPVQLTPDDEDAIKRIDQSEFEGFEYINPLLLSTEESV 592
Okay, so the human protein has been described previously: it is human protein kinase C zeta. Has the PB1 domain in PKCzeta and its implications been previously discussed? A quick PubMed search turned up two papers from earlier this decade (in Molecular Cell & JBC) which actually demonstrated the dimerization potential of the PKCzeta PB1 domain. So the PKC domain with a PB1 domain is not novel & noteworthy. What about the missing diacylglycerol-binding domain (that first big gap) in the chimp kinase? That could be interesting, so let's see what whether we can find any EST evidence to support it. Alas, the only EST evidence refutes it and identifies the gap as spurious(and both of these ESTs were deposited in October 2007 and the paper submitted in March 2008, so they are not an unfair criticism)
>dbj|DC524857.1| DC524857 chimpanzee brain cDNA library PflB Pan troglodytes verus
cDNA clone PflB8010 5', mRNA sequence.
Score = 108 bits (270), Expect(2) = 3e-31, Method: Composition-based stats.
Identities = 57/102 (55%), Positives = 58/102 (56%), Gaps = 44/102 (43%)
Frame = +2
Query 112 KSIYRRGARRWRKLYCANGHLFQAKRFNR------------------------------- 140
Sbjct 38 ESIYRRGARRWRKLYCANGHLFQAKRFNRRAYCGQCSERIWGLARQGYRCINCKLLVHKR 217
Query 141 -------------DSVMPSQEPPVDDKNEDADLPSEETDGIA 169
Sbjct 218 CHGLVPLTCRKHMDSVMPSQEPPVDDKNEDADLPSEETDGIA 343
Score = 42.7 bits (99), Expect(2) = 3e-31, Method: Compositional matrix adjust.
Identities = 20/22 (90%), Positives = 21/22 (95%), Gaps = 0/22 (0%)
Frame = +3
Query 168 IAYISSSRKHDSIKDDSEDLKP 189
Sbjct 339 LLYISSSRKHDSIKDDSEDLKP 404
>dbj|DC519886.1| DC519886 chimpanzee brain cDNA library PccB Pan troglodytes verus
cDNA clone PccB0482 5', mRNA sequence.
Score = 114 bits (284), Expect = 3e-26, Method: Compositional matrix adjust.
Identities = 63/107 (58%), Positives = 63/107 (58%), Gaps = 44/107 (41%)
Frame = +2
Query 113 SIYRRGARRWRKLYCANGHLFQAKRFNR-------------------------------- 140
Sbjct 290 SIYRRGARRWRKLYCANGHLFQAKRFNRRAYCGQCSERIWGLARQGYRCINCKLLVHKRC 469
Query 141 ------------DSVMPSQEPPVDDKNEDADLPSEETDGIAYISSSR 175
Sbjct 470 HGLVPLTCRKHMDSVMPSQEPPVDDKNEDADLPSEETDGIAYISSSR 610
I checked in detail one more note (about the chimp protein ENSPTRP00000001185 and its human ortholog) about a domain architecture claimed to be unique to human & chimp due to a missing domain. Again, the RefSeq protein search revealed that the chimp protein is nearly identical to a known human kinase (MARK2) albeit greatly truncated -- and the missing domain is beyond the truncation point.
I haven't checked every kinase in the paper, but seeing the same classes of mistakes repeatedly doesn't give much hope. Comparing the human & chimp kinomes (or any other well-defined subset of genes) is a worthwhile enterprise -- so long as it is kept in mind that the chimp genome is a very rough draft and all appropriate computational controls are used. This paper, unfortunately, shows no awareness of either of these principles.
What irks me most about this sort of paper is that it gives all of us a bit of a black eye. Someone who saw the abstract & got excited would be in for a big letdown. It's hard enough to earn the respect of bench biologists without it being tossed away with poorly done analyses.
So what are the positive lessons to be learned? Here are a few tips
- Always try to find meaningful biological names for your sequences. Use them in your figures & search them in the literature like a bloodhound.
- Always check genomic predictions against EST & cDNA databases.
- Always try to root your phylogenetic trees, unless you have a really good reason not to do so. And, if your tree is rooted, you must explain how you rooted it
- If your results sound amazing, take a deep breath & think of several tests that could debunk them. Then do those ten tests. If they survive, go to bed & think of another batch of tests.
P.S. One way to put reviewers in a bad mood is to not supply your sequences. The supplementary materials for this paper do not have all of the ORFs; I pulled some out from their alignments with a custom script. Elsewhere via Google I found a collection linked to the work -- but with the whole predicted chimp proteome in it! Very unwieldy & slow to download!
P.P.S. For anyone interested in exploring further, here are the other sequences from the alignment in additional file 3. The number in the header was added by my script to indicate which alignment within that file the sequence was taken from.