Monday, February 02, 2009

A Fatally Flawed Paper

I like to review manuscripts but don't do so very often. When I started this blog I thought I might often use it to play "If I had been the reviewer", but I haven't done that much. However, a paper came to my attention that I can't stop thinking about until I tackle it here.

As an aside, I find papers I review to fall into three categories. The first are very solid papers that I can find little to comment on; I might make a suggestion or two (often about data visualization), but if the core is solid there isn't much for the reviewer to do. The second category is the most frustrating: when I feel the paper is on the edges of my expertise & I start to question whether I should have agreed to review it (which is done after seeing an abstract). The third category is the one I can really dig into: seriously flawed papers. I think one of my reviews of a paper was approaching the length of the manuscript; the paper was badly flawed but there was a thread of substance that with a lot of work could be turned into something decent.

Anyway, I noticed this paper in the BioMedCentral Table of Contents extract which emailed to me weekly.
Comparative kinomics of human and chimpanzee reveals unique kinship and functional diversity generated by new domain combinations.
. Now, back at MLNM I had for a while specialized in protein kinases, so it is a field of some interest. I hadn't kept up with the status of the chimpanzee genome sequencing, but there is a longstanding familial interest in this species so that was another angle of interest.

Sometimes when there has been some accident, a review of the circumstances leading up to it will reveal many opportunities for recognizing that a bad situation had been set up: the engineer ignored a stop signal or the dispatcher should have noticed the switch was set incorrectly. This paper, particularly one of its centerpiece findings, has that feel to it: there were many warning flags that something was amiss, but unfortunately the authors and the reviewers failed to see them.

When I first planned this critique, I was going to detail several examples. However, that would seem to lead to a very long post, so I will pick a few examples and claim that it is representative. If anyone wishes to challenge that claim, then I'll flesh out some more. Also, I feel the first example is particularly apropos because it is a bit of a centerpiece; it gets a lot of space (including a special figure) in the text.

It was this bit of text that caused me to raise my eyebrows as far as they could go (I wish I could do the Spock single-eyebrow raise, but I can't). The bolding is mine to emphasize the big surprises.
For example, a chimpanzee kinase classified as casein kinase 1 (ENSPTRP00000001150) on the basis of significant sequence similarity (31%) of the catalytic domain and excellent e-value (2e-16) with the casein kinase 1 from human. However this chimp kinase has a POLO BOX tethered to the kinase catalytic domain.

Thus this chimp kinase represents a hybrid CK1_POLO kinase. Interestingly ENSEMBL reports that ENSPTRP00000001150 has a high similarity with the human kinase ENSP00000361275. However, according to our classification protocol ENSP00000361275 is classified as a POLO kinase on the basis of 52% sequence identity with classical POLO kinases and excellent e-value of e-112. Figure 1 shows the dendrogram of the CK1 sub-family of kinases and it highlights the significant divergence of chimp homologue from its counterparts in other organisms


The first huge surprise is to find a kinase with so little sequence identity to its closest human counterpart. The DNA identity of human and chimp is routinely cited in the high 90 percent (how exactly you calculate it affects the final value) and they are our closest relatives. Finding a human-mouse ortholog identity of less than 31% would be stunning; for human-chimp it would be indescribably surprising. The second huge surprise is the claim of a hybrid Polo-CK1 kinase. The Polo box is a domain which recognizes phosphorylated peptides and is important in the activation & substrate recognition by Polo kinases. It is the signature of the Polo subfamily and has not been reported to be found on any other protein. The third surprise is in the dendrogram; it is claimed that this kinase has an affinity to CK1-type kinaess, but in their rooted dendrogram (source of rooting not explained, a serious error) this kinase is an outgroup to all of the other presented kinases! Without some true outgroups (ideally representatives of other key families), how can we tell what it is most similar to?

Now, a strong criticism of mine of this paper is that it relies too much on Ensembl-derived sequences and annotation. Ensembl is a great system & I have high respect for it, but it is also trying to do the very complex job of integrating a lot of other data with genomic sequences of varying quality and we are not scientists if we fully trust it to always be correct. It is much better to have a more definitive reference point; why rely on someone's hand sketched map if you have a USGS topographic section available? And for a solid anchor database, it is hard to beat the RefSeq human protein dataset. So, we take the sequence from their figure for this ORF

>3|Chimp|ENSPTRP00000001150
SLAHIWKARHTLLEPEVRYYLRQILSGLKYLHQRGILHRDLKLGNFFITENMELKVGDF
GLAARLEPPEQRKKTICGTPNYVAPEVLLRQGHGPEADVWSLGCVMYTLLCGSPPFETA
DLKETYRCIKQVHYTLPASLSLPARQLLAAILRASPRDRPSIDQILRHDFFTKGYTPDR
LPISSCVTVPDLTPPNPARSLFAKVTKSLFGRKKKKSKNHAQESDEVSGLVSGLMRTSV
GHQDARPEAPAASGPAPVSLVETAPEDSSPRGTLASSGDGFEEGLTVATVVESALCALR
NCVAFMPPAEQNPAPLAQPEPLVWVSKWVDYGGDLPSVEEVEVPAPPLLLQWVKTDQAL
LMLFSDGTVQVNFYGDHTKLILSGWEPLLVTFVARNRSACTYLASHLRQLGCSPDLRQRLRYALRLLRDRSPA

and our top hit is

GENE ID: 1263 PLK3 | polo-like kinase 3 (Drosophila) [Homo sapiens]
(Over 10 PubMed links)

Score = 663 bits (1710), Expect = 0.0, Method: Compositional matrix adjust.
Identities = 328/362 (90%), Positives = 335/362 (92%), Gaps = 14/362 (3%)

That resolves all these questions: it's a straightforward ortholog of PLK3 (which explains the Polo boxes), not some noteworthy hybrid and the sequence identity is 90+% -- and that score is dropped a lot by some iffy regions like this

Query 301 MPPAEQNPAPLAQPEPLVWVSKWVDYGGDLPSVEEVEVPAPPLLLQWVKTDQALLMLFSD 360
MPPAEQNPAPLAQPEPLVWVSKWVDY + + + + +LF+D
Sbjct 445 MPPAEQNPAPLAQPEPLVWVSKWVDYSNKFG-------------FGYQLSSRRVAVLFND 491

Query 361 GT 362
GT
Sbjct 492 GT 493


Score = 176 bits (446), Expect = 1e-43, Method: Compositional matrix adjust.
Identities = 83/91 (91%), Positives = 87/91 (95%), Gaps = 0/91 (0%)

Query 319 WVSKWVDYGGDLPSVEEVEVPAPPLLLQWVKTDQALLMLFSDGTVQVNFYGDHTKLILSG 378
++ + + GGDLPSVEEVEVPAPPLLLQWVKTDQALLMLFSDGTVQVNFYGDHTKLILSG
Sbjct 538 YMEQHLMKGGDLPSVEEVEVPAPPLLLQWVKTDQALLMLFSDGTVQVNFYGDHTKLILSG 597

Query 379 WEPLLVTFVARNRSACTYLASHLRQLGCSPD 409
WEPLLVTFVARNRSACTYLASHLRQLGCSPD
Sbjct 598 WEPLLVTFVARNRSACTYLASHLRQLGCSPD 628

What's going on there? Well, most likely this is underlining the draft nature of the chimpanzee genome. I checked with TBLASTN, and there aren't ESTs around this region -- the chimp PLK3 is pretty much a pure gene prediction model -- a tough problem that has been tackled well but never perfectly. Plus, the underlying genomic data is, well, draft quality. Another TBLASTN search revealed that although this Ensembl prediction is from the middle of a large contig, the N-terminus of human PLK3 has a great match on another contig -- but from the same chromosome.

Okay, maybe that's a fluke. So here's another chimp kinase highlighted in the text

A protein (ENSPTRP00000000076), classified under PKC subfamily, is composed of a PB1 domain followed by the protein kinase domain which is followed by a protein kinase C terminal domain (Figure 3a1). The PB1 domain is present in many eukaryotic cytoplasmic signalling proteins and is responsible, although not systematically, in the formation of PB1 dimers [25]. It thus serves as a molecular recognition module. This architecture is known so far only in an atypical PKC of Phallusia mammilata, a sea squirt. Our analysis identified two chimpanzee PKCs and a human PKC with a similar architecture, in which a phorbol esters/diacylglycerol binding domain is inserted between the PB1 and the protein kinase domain. The presence of the phorbol esters/diacylglycerol binding domain in combination with the protein kinase and a PKC terminal domain indicates that it is probably responsible for the recruitment of diacylglycerol, which in turns might be involved in activation of the kinase. The deletion of this domain in chimpanzee PKC (ENSPTRP00000000076) implies that the recruitment of diacylglycerol might be achieved by an external interacting module.


Again, the first thing to do is to search the ORF

>1|Chimp|ENSPTRP00000000076
MPSRTGPKMEGSGGRVRLKAHYGGDIFITSVDAATTFEELCEEVRDMCRLHQQHPL
TLKWVDSEGDPCTVSSQMELEEAFRLARQCRDEGLIIHVFPSTPEQPGLPCPGEDK
SIYRRGARRWRKLYCANGHLFQAKRFNRDSVMPSQEPPVDDKNEDADLPSEETDGI
AYISSSRKHDSIKDDSEDLKPVIDGMDGIKISQGLGLQDFDLIRVIGRGSYAKVLL
VRLKKNDQIYAMKVVKKELVHDDETTSRLFLVIEYVNGGDLMFHMQRQRKLPEEHA
RFYAAEICIALNFLHERGIIYRDLKLDNVLLDADGHIKLTDYGMCKEGLGPGDTTS
TFCGTPNYIAPEILRGEEYGFSVDWWALGVLMFEMMAGRSPFDIITDNPDMNTEDY
LFQVILEKPIRIPRFLSVKASHVLKGFLNKDPKERLGCRPQTGFSDIKSHAFFRSI
DWDLLEKKQALPPFQPQITDDYGLDNFDTQFTSEPVQLTPDDEDAIKRIDQSEFEG
FEYINPLLLSTEESV
against human RefSeq to get our bearings.

GENE ID: 5590 PRKCZ | protein kinase C, zeta [Homo sapiens]
(Over 100 PubMed links)

Score = 1031 bits (2665), Expect = 0.0, Method: Compositional matrix adjust.
Identities = 518/592 (87%), Positives = 518/592 (87%), Gaps = 73/592 (12%)

Query 1 MPSRTGPKMEGSGGRVRLKAHYGGDIFITSVDAATTFEELCEEVRDMCRLHQQHPLTLKW 60
MPSRTGPKMEGSGGRVRLKAHYGGDIFITSVDAATTFEELCEEVRDMCRLHQQHPLTLKW
Sbjct 1 MPSRTGPKMEGSGGRVRLKAHYGGDIFITSVDAATTFEELCEEVRDMCRLHQQHPLTLKW 60

Query 61 VDSEGDPCTVSSQMELEEAFRLARQCRDEGLIIHVFPSTPEQPGLPCPGEDKSIYRRGAR 120
VDSEGDPCTVSSQMELEEAFRLARQCRDEGLIIHVFPSTPEQPGLPCPGEDKSIYRRGAR
Sbjct 61 VDSEGDPCTVSSQMELEEAFRLARQCRDEGLIIHVFPSTPEQPGLPCPGEDKSIYRRGAR 120

Query 121 RWRKLYCANGHLFQAKRFNR---------------------------------------- 140
RWRKLY ANGHLFQAKRFNR
Sbjct 121 RWRKLYRANGHLFQAKRFNRRAYCGQCSERIWGLARQGYRCINCKLLVHKRCHGLVPLTC 180

Query 141 ----DSVMPSQEPPVDDKNEDADLPSEETDGIAYISSSRKHDSIKDDSEDLKPVIDGMDG 196
DSVMPSQEPPVDDKNEDADLPSEETDGIAYISSSRKHDSIKDDSEDLKPVIDGMDG
Sbjct 181 RKHMDSVMPSQEPPVDDKNEDADLPSEETDGIAYISSSRKHDSIKDDSEDLKPVIDGMDG 240

Query 197 IKISQGLGLQDFDLIRVIGRGSYAKVLLVRLKKNDQIYAMKVVKKELVHDDE-------- 248
IKISQGLGLQDFDLIRVIGRGSYAKVLLVRLKKNDQIYAMKVVKKELVHDDE
Sbjct 241 IKISQGLGLQDFDLIRVIGRGSYAKVLLVRLKKNDQIYAMKVVKKELVHDDEDIDWVQTE 300

Query 249 ---------------------TTSRLFLVIEYVNGGDLMFHMQRQRKLPEEHARFYAAEI 287
TTSRLFLVIEYVNGGDLMFHMQRQRKLPEEHARFYAAEI
Sbjct 301 KHVFEQASSNPFLVGLHSCFQTTSRLFLVIEYVNGGDLMFHMQRQRKLPEEHARFYAAEI 360

Query 288 CIALNFLHERGIIYRDLKLDNVLLDADGHIKLTDYGMCKEGLGPGDTTSTFCGTPNYIAP 347
CIALNFLHERGIIYRDLKLDNVLLDADGHIKLTDYGMCKEGLGPGDTTSTFCGTPNYIAP
Sbjct 361 CIALNFLHERGIIYRDLKLDNVLLDADGHIKLTDYGMCKEGLGPGDTTSTFCGTPNYIAP 420

Query 348 EILRGEEYGFSVDWWALGVLMFEMMAGRSPFDIITDNPDMNTEDYLFQVILEKPIRIPRF 407
EILRGEEYGFSVDWWALGVLMFEMMAGRSPFDIITDNPDMNTEDYLFQVILEKPIRIPRF
Sbjct 421 EILRGEEYGFSVDWWALGVLMFEMMAGRSPFDIITDNPDMNTEDYLFQVILEKPIRIPRF 480

Query 408 LSVKASHVLKGFLNKDPKERLGCRPQTGFSDIKSHAFFRSIDWDLLEKKQALPPFQPQIT 467
LSVKASHVLKGFLNKDPKERLGCRPQTGFSDIKSHAFFRSIDWDLLEKKQALPPFQPQIT
Sbjct 481 LSVKASHVLKGFLNKDPKERLGCRPQTGFSDIKSHAFFRSIDWDLLEKKQALPPFQPQIT 540

Query 468 DDYGLDNFDTQFTSEPVQLTPDDEDAIKRIDQSEFEGFEYINPLLLSTEESV 519
DDYGLDNFDTQFTSEPVQLTPDDEDAIKRIDQSEFEGFEYINPLLLSTEESV
Sbjct 541 DDYGLDNFDTQFTSEPVQLTPDDEDAIKRIDQSEFEGFEYINPLLLSTEESV 592


Okay, so the human protein has been described previously: it is human protein kinase C zeta. Has the PB1 domain in PKCzeta and its implications been previously discussed? A quick PubMed search turned up two papers from earlier this decade (in Molecular Cell & JBC) which actually demonstrated the dimerization potential of the PKCzeta PB1 domain. So the PKC domain with a PB1 domain is not novel & noteworthy. What about the missing diacylglycerol-binding domain (that first big gap) in the chimp kinase? That could be interesting, so let's see what whether we can find any EST evidence to support it. Alas, the only EST evidence refutes it and identifies the gap as spurious(and both of these ESTs were deposited in October 2007 and the paper submitted in March 2008, so they are not an unfair criticism)

>dbj|DC524857.1| DC524857 chimpanzee brain cDNA library PflB Pan troglodytes verus
cDNA clone PflB8010 5', mRNA sequence.
Length=404

Score = 108 bits (270), Expect(2) = 3e-31, Method: Composition-based stats.
Identities = 57/102 (55%), Positives = 58/102 (56%), Gaps = 44/102 (43%)
Frame = +2

Query 112 KSIYRRGARRWRKLYCANGHLFQAKRFNR------------------------------- 140
+SIYRRGARRWRKLYCANGHLFQAKRFNR
Sbjct 38 ESIYRRGARRWRKLYCANGHLFQAKRFNRRAYCGQCSERIWGLARQGYRCINCKLLVHKR 217

Query 141 -------------DSVMPSQEPPVDDKNEDADLPSEETDGIA 169
DSVMPSQEPPVDDKNEDADLPSEETDGIA
Sbjct 218 CHGLVPLTCRKHMDSVMPSQEPPVDDKNEDADLPSEETDGIA 343


Score = 42.7 bits (99), Expect(2) = 3e-31, Method: Compositional matrix adjust.
Identities = 20/22 (90%), Positives = 21/22 (95%), Gaps = 0/22 (0%)
Frame = +3

Query 168 IAYISSSRKHDSIKDDSEDLKP 189
+ YISSSRKHDSIKDDSEDLKP
Sbjct 339 LLYISSSRKHDSIKDDSEDLKP 404


>dbj|DC519886.1| DC519886 chimpanzee brain cDNA library PccB Pan troglodytes verus
cDNA clone PccB0482 5', mRNA sequence.
Length=612

Score = 114 bits (284), Expect = 3e-26, Method: Compositional matrix adjust.
Identities = 63/107 (58%), Positives = 63/107 (58%), Gaps = 44/107 (41%)
Frame = +2

Query 113 SIYRRGARRWRKLYCANGHLFQAKRFNR-------------------------------- 140
SIYRRGARRWRKLYCANGHLFQAKRFNR
Sbjct 290 SIYRRGARRWRKLYCANGHLFQAKRFNRRAYCGQCSERIWGLARQGYRCINCKLLVHKRC 469

Query 141 ------------DSVMPSQEPPVDDKNEDADLPSEETDGIAYISSSR 175
DSVMPSQEPPVDDKNEDADLPSEETDGIAYISSSR
Sbjct 470 HGLVPLTCRKHMDSVMPSQEPPVDDKNEDADLPSEETDGIAYISSSR 610



I checked in detail one more note (about the chimp protein ENSPTRP00000001185 and its human ortholog) about a domain architecture claimed to be unique to human & chimp due to a missing domain. Again, the RefSeq protein search revealed that the chimp protein is nearly identical to a known human kinase (MARK2) albeit greatly truncated -- and the missing domain is beyond the truncation point.

I haven't checked every kinase in the paper, but seeing the same classes of mistakes repeatedly doesn't give much hope. Comparing the human & chimp kinomes (or any other well-defined subset of genes) is a worthwhile enterprise -- so long as it is kept in mind that the chimp genome is a very rough draft and all appropriate computational controls are used. This paper, unfortunately, shows no awareness of either of these principles.

What irks me most about this sort of paper is that it gives all of us a bit of a black eye. Someone who saw the abstract & got excited would be in for a big letdown. It's hard enough to earn the respect of bench biologists without it being tossed away with poorly done analyses.

So what are the positive lessons to be learned? Here are a few tips

  1. Always try to find meaningful biological names for your sequences. Use them in your figures & search them in the literature like a bloodhound.

  2. Always check genomic predictions against EST & cDNA databases.

  3. Always try to root your phylogenetic trees, unless you have a really good reason not to do so. And, if your tree is rooted, you must explain how you rooted it

  4. If your results sound amazing, take a deep breath & think of several tests that could debunk them. Then do those ten tests. If they survive, go to bed & think of another batch of tests.



P.S. One way to put reviewers in a bad mood is to not supply your sequences. The supplementary materials for this paper do not have all of the ORFs; I pulled some out from their alignments with a custom script. Elsewhere via Google I found a collection linked to the work -- but with the whole predicted chimp proteome in it! Very unwieldy & slow to download!
P.P.S. For anyone interested in exploring further, here are the other sequences from the alignment in additional file 3. The number in the header was added by my script to indicate which alignment within that file the sequence was taken from.

>2|Chimp|ENSPTRP00000019171
MSAEVRLRRLQQLVLDPGFLGLEPLLDLLLGVHQELGASELAQDKYVADFLQWAEPIVVRL
KEVRLQRDDFEILKVIGRGAFSEVAVVKMKQTGQVYAMKIMNKWDMLKRGEVSCFREERDV
LVNGDRRWITQLHFAFQDENYLYLVMEYYVGGDLLTLLSKFGERIPAEMARFYLAEIVMAI
DSVHRLGYVHRDIKPDNILLDRCGHIRLADFGSCLKLRADGTVRSLVAVGTPDYLSPEILQ
AVGGGPGTGSYGPECDWWALGVFAYEMFYGQTPFYADSTAETYGKIVHYKEHLSLPLVDEG
VPEEARDFIQRLLCPPETRLGRGGAGDFRTHPFFFGLDWDGLRDSVPPFTPDFEGATDTCN
FDLVEDGLTAMVSGGGETLSDIREGAPLGVHLPFVGYSYSCMALRDSEVPGPTPMELEAEQ
LLEPHVQAPSLEPSVSPQDETAEVAVPAAVPAAEAEAEVTLRELQEALEEEVLTRQSLSRE
MEAIRTDNQNFASQLREAEARNRDLEAHVRQLQERMELLQAEGATAVTGVPSPRATDPPSH
VPWPGLSXALSLLLFAVVLSRAAALGCLGLVAPAGXLXAVWRRPGAARAPX
>4|Chimp|ENSPTRP00000011569
MSDVAIVKEGWLHKRGEYIKTWRPRYFLLKNDGTFIGYKERPQDVDQREAPLNNFSVAQCQ
LMKTERPRPNTFIIRCLQWTTVIERTFHVETPEEREEWTTAIQTVADGLKKQEEEEMDFRS
GSPSDNSGAEEMEVSLAKPKHRVTMNEFEYLKLLGKGTFGKVILVKEKATGRYYAMKILKK
EVIVAKDEVAHTLTENRVLQNSRHPFLTALKYSFQTHDRLCFVMEYANGGELFFHLSRERV
FSEDRARFYGAEIVSALDYLHSEKNVVYRDLKLENLMLDKDGHIKITDFGLCKEGIKDGAT
MKTFCGTSEYLAPRLSPPFKPQVTSETDTRYFDEEFTAQMITITPP
DQDDSMECVDSERRPHFPQFSYSASGTA

7 comments:

Jonathan Badger said...

You wrote: Always try to root your phylogenetic trees, unless you have a really good reason to do so. And, if your tree is rooted, you must explain how you rooted it

Are you trying to say:
1) Always try to root your phylogenetic trees, unless you have a really good reason not to do so.

or

2) Never root your phylogenetic trees, unless you have a really good reason to do so.

There are arguments for rooting and not rooting, so I'm not just being picky about grammar; I really don't know which side you are supporting.

Keith Robison said...

DUH!!! Thanks for catching that missing "NOT". I probably would, on thinking about it, write something different, such as "If you haven't explicitly rooted your tree, then don't treat it as a rooted tree".

Unknown said...

Good catch, Keith. Unfortunately everyone is now an expert and uses published information rather than revalidating what are in actuality tentative results.

I do a lot of reanalysis of genomics projects as a consultant. Guess it's true that there's always time (and money) to do something twice, but never time to do it once properly.

- Brian Moldover (and hi!)

Anonymous said...

This is hilarious. I remember slaugthering an earlier version of this manuscript when it was in review at a different journal. I guess you can get anything published if you just shop it around long enough...

Anonymous said...

Can I use your example to craft a bioinformatics laboratory for a class?

Keith Robison said...

I meant to plant the idea that this paper could be a useful starting point for a bioinformatics course exercise. My grad school professors loved to require us to read seriously flawed papers (in one case, a complete fraud) to see what we would pick up on (none of us figured out the fraudulent paper was such, a good exercise in instilling humility!)

Chris Cotsapas said...

@AnonymousReviewer - one of my great frustrations with peer review is spending a non-trivial amount of time constructively criticising a manuscript, only to see it resurface in all its flawed glory at some other journal. Gah!