It's that time of year again -- when those of us in the U.S. must deal with numbered forms and lettered schedules. In this light, I wish to share a recent piece of correspondence:
Dear Dr. Robison:
After great difficulty (must your handwriting be so atrocious?) I have reviewed the accounts at your business enterprise. I regret to inform you that two of your accounts, with ATP Corp and NAD(P)H Ltd, are grossly out of balance. While you are running a deficit with the former and a surplus with the latter, as we have discussed previously these separate accounts cannot be merged. Your enterprise is doomed to failure (and I think it goes without saying that some sort of Madoffian scheme will not be countenanced by me). You must bring these into balance or your enterprise would fail, never mind the horror of trying to explain this in an audit.
I realize I am not qualified to comment on the technical aspects of your effort. However, may I suggest you get out of the lab more and get some fresh air? Perhaps some oxygen would stimulate your activity in a most productive way?
Sincerely,
Colin Escherich, C.P.A.
A computational biologist's personal views on new technologies & publications on genomics & proteomics and their impact on drug discovery
Saturday, February 28, 2009
Sunday, February 08, 2009
Any Genome Sequence You Want, As Long As It's Human
It's been interesting reading dispatches coming from bloggers Dan Kobolt and Daniel MacArthur who are attending the Marco Island conference, the big yearly confab on bleeding edge sequencing technology. How have I resisted this conference for so long, especially with the climate draw???
One company that is again receiving a lot of attention is Complete Genomics, which is proposing to build a set of sequencing centers to sequence human genomes at $5K a pop. What is striking is that their business model is to sequence only human genomes and nothing else, which particularly surprised Daniel MacArthur at Genetic Futures.
As a biologist and someone fascinated with all genomes, such a policy is not a welcome thought. But, as someone who has worked in an industrial high-throughput production facility, I think I can reverse engineer the logic pretty well (I have no connections to or inside information from the company).
Why would you want to do this? Simplicity. By focusing on only a single genome, all sorts of simplifications are created. Complexity costs significant money & time, and it is often what seems trivial that ends up being very costly. Just allowing a second genome in the door creates all sorts of additional work on the software side, and if that second source requires different sample prep that's an additional headache on the lab side.
Having only one genome kicking around also creates some interesting opportunities for quality control both for each sample and for the whole factory (which is what they are talking about building: a sequencing factory). One genome means only one reference sequence to compare against & one set of pathological problems for their assembly algorithm to be fortified against. One genome also means that if you see another genome in your data, you know something is wrong -- and if you see the same one genome repeatedly you may have a factory-wide problem.
"Any color you want so long as it is black" got Ford to the top of the U.S. automotive heap, but it didn't keep them there -- I believe that GM's offering colors helped push them into first. So will the market support Complete's vision? I think it can.
Complete is apparently talking about running a million genomes per year. At $5K each, that would be $5 billion, some serious cash flow. I don't know if they've estimated the market correctly, but it doesn't seem ridiculous. If a large fraction of the world's wealthy decide to sequence their genomes (and their children's too) and if sequencing tumors becomes semi-routine, a few million human genomes a year doesn't seem totally ridiculous. Of course, Complete would have to fight with all the other players for a share.
That implies a question: what comparable markets are they giving up? I'd love to see broader "zoonomics", where we go through the living world sequencing everything, but that's all going to be grant funded. Smaller genomes may also be completely mismatched with this sort of technology -- without some sort of multiplexing (complexity!). Similarly, it's not easy to see some big commercial market for metagenomics -- it will remain fascinating & there's no end to the ecological niches to explore, but who in the private sector is going to pony up major money for it? Oncogenic mouse models will supply lots of tumors for sequencing, but again probably not a big private sector activity.
The one area I can almost envision is sequencing valuable livestock or agricultural lines to understand their complete makeup. If this were done not only for parentals but for offspring in breeding programs, then perhaps a big market would be generated. But, is it really worth sequencing to completion or will some cheaper technology for skimming the surface suffice? If there is a market, then a logical business direction for Complete might be to do a joint venture or spinout focusing on alternate genomes -- but either the prize would need to be big or the one genome business model failing for that to be worth diverting attention.g
One company that is again receiving a lot of attention is Complete Genomics, which is proposing to build a set of sequencing centers to sequence human genomes at $5K a pop. What is striking is that their business model is to sequence only human genomes and nothing else, which particularly surprised Daniel MacArthur at Genetic Futures.
As a biologist and someone fascinated with all genomes, such a policy is not a welcome thought. But, as someone who has worked in an industrial high-throughput production facility, I think I can reverse engineer the logic pretty well (I have no connections to or inside information from the company).
Why would you want to do this? Simplicity. By focusing on only a single genome, all sorts of simplifications are created. Complexity costs significant money & time, and it is often what seems trivial that ends up being very costly. Just allowing a second genome in the door creates all sorts of additional work on the software side, and if that second source requires different sample prep that's an additional headache on the lab side.
Having only one genome kicking around also creates some interesting opportunities for quality control both for each sample and for the whole factory (which is what they are talking about building: a sequencing factory). One genome means only one reference sequence to compare against & one set of pathological problems for their assembly algorithm to be fortified against. One genome also means that if you see another genome in your data, you know something is wrong -- and if you see the same one genome repeatedly you may have a factory-wide problem.
"Any color you want so long as it is black" got Ford to the top of the U.S. automotive heap, but it didn't keep them there -- I believe that GM's offering colors helped push them into first. So will the market support Complete's vision? I think it can.
Complete is apparently talking about running a million genomes per year. At $5K each, that would be $5 billion, some serious cash flow. I don't know if they've estimated the market correctly, but it doesn't seem ridiculous. If a large fraction of the world's wealthy decide to sequence their genomes (and their children's too) and if sequencing tumors becomes semi-routine, a few million human genomes a year doesn't seem totally ridiculous. Of course, Complete would have to fight with all the other players for a share.
That implies a question: what comparable markets are they giving up? I'd love to see broader "zoonomics", where we go through the living world sequencing everything, but that's all going to be grant funded. Smaller genomes may also be completely mismatched with this sort of technology -- without some sort of multiplexing (complexity!). Similarly, it's not easy to see some big commercial market for metagenomics -- it will remain fascinating & there's no end to the ecological niches to explore, but who in the private sector is going to pony up major money for it? Oncogenic mouse models will supply lots of tumors for sequencing, but again probably not a big private sector activity.
The one area I can almost envision is sequencing valuable livestock or agricultural lines to understand their complete makeup. If this were done not only for parentals but for offspring in breeding programs, then perhaps a big market would be generated. But, is it really worth sequencing to completion or will some cheaper technology for skimming the surface suffice? If there is a market, then a logical business direction for Complete might be to do a joint venture or spinout focusing on alternate genomes -- but either the prize would need to be big or the one genome business model failing for that to be worth diverting attention.g
Wednesday, February 04, 2009
Trading off an argument from Scrubs
I was watching Scrubs (My New Role) last night & there was an exchange that I think should be a discussion point for everyone involved in medicine, though it wasn't the point the script writers really hammered on.
The setup is that a nurse was trying to get a doctor to change the antibiotic for a patient. The nurse's argument was that azithromycin required once daily dosing and would free her up for doing other things, where as the doctor's selection of clindamycin meant 4 times daily dosing. The doctor replied in a condescending way that she had gone to med school, the nurse hadn't, and therefore the script would stand as written.
Now, the theme of the episode was this sort of professional interaction -- where someone higher on the professional totem pole disrespects someone lower. An important issue, to be sure. But I think, especially in these days when we are more than ever concerned about the cost of healthcare & how to deliver effective healthcare economically, the specific argument deserves more attention.
Now, I'll confess I haven't gone to med school & I have no particular expertise in antibiotics, other than practical experience. For example, my wife is allergic to huge numbers, TNG broke out with Augmentin, doxycycline gives me a stomachache if I try to take it on an empty stomach & penicillin is mostly excreted, not metabolized & you'll notice this in the bathroom once it has cleared the infection from your nasal passages. But I can't reasonably discuss azithromycin vs clindamycin on actual facts, so I'll use them as proxies for some hypotheticals.
Suppose, for example, that there was absolutely no clinical difference between the two. They both had the same spectrum of treatable bacteria, the same risk of similar side effects, no contraindications in this patient and both had the same cost. Then clearly the nurse is right and the doctor wrong, as that once-a-day dosing frees a valuable resource (the nurse). In other words, under these conditions the drug choice for a patient is neutral for that patient but has important ramifications for other patients at the hospital.
But what about the less clear cases. For example, suppose all of the above conditions were met except equal cost; the once daily med is significantly more expensive (e.g. azithromycin before it went off patent). On the one hand, my argument still holds unless it is a huge cost difference -- several minutes of a nurses' time is worth quite a bit (like most hospitals, the one on Scrubs is portrayed as being cash strapped & short on nurses). However, that more convenient drug costs real money, whereas the nurse's saving is in opportunity cost: an accountant browsing the budget is likely to see the one but not the other even if both are real.
Now let's muddy the water further. Suppose they two drugs are clinically not precisely comparable but similar -- imagine if clindamycin is slightly broader spectrum or has a slightly lower risk of side effects. Now it becomes a really sticky wicket -- what additional risk to this patient is acceptable in order to reduce the risks to other patients (due to getting better nursing care).
That last one is the sort that really is troublesome. We never like explicitly to risk one person to help multiple others, but we are often less troubled when we do it implicitly. I won't claim to be an ethics expert, so I'll leave it at that. But I think these scenarios embody real situations which will be faced, such as sometimes an expensive drug is better than a cheaper one & (not to say this is always or even often true, just that it isn't always false). Or more generally: health care reform will be complex, because health care is complex.
The setup is that a nurse was trying to get a doctor to change the antibiotic for a patient. The nurse's argument was that azithromycin required once daily dosing and would free her up for doing other things, where as the doctor's selection of clindamycin meant 4 times daily dosing. The doctor replied in a condescending way that she had gone to med school, the nurse hadn't, and therefore the script would stand as written.
Now, the theme of the episode was this sort of professional interaction -- where someone higher on the professional totem pole disrespects someone lower. An important issue, to be sure. But I think, especially in these days when we are more than ever concerned about the cost of healthcare & how to deliver effective healthcare economically, the specific argument deserves more attention.
Now, I'll confess I haven't gone to med school & I have no particular expertise in antibiotics, other than practical experience. For example, my wife is allergic to huge numbers, TNG broke out with Augmentin, doxycycline gives me a stomachache if I try to take it on an empty stomach & penicillin is mostly excreted, not metabolized & you'll notice this in the bathroom once it has cleared the infection from your nasal passages. But I can't reasonably discuss azithromycin vs clindamycin on actual facts, so I'll use them as proxies for some hypotheticals.
Suppose, for example, that there was absolutely no clinical difference between the two. They both had the same spectrum of treatable bacteria, the same risk of similar side effects, no contraindications in this patient and both had the same cost. Then clearly the nurse is right and the doctor wrong, as that once-a-day dosing frees a valuable resource (the nurse). In other words, under these conditions the drug choice for a patient is neutral for that patient but has important ramifications for other patients at the hospital.
But what about the less clear cases. For example, suppose all of the above conditions were met except equal cost; the once daily med is significantly more expensive (e.g. azithromycin before it went off patent). On the one hand, my argument still holds unless it is a huge cost difference -- several minutes of a nurses' time is worth quite a bit (like most hospitals, the one on Scrubs is portrayed as being cash strapped & short on nurses). However, that more convenient drug costs real money, whereas the nurse's saving is in opportunity cost: an accountant browsing the budget is likely to see the one but not the other even if both are real.
Now let's muddy the water further. Suppose they two drugs are clinically not precisely comparable but similar -- imagine if clindamycin is slightly broader spectrum or has a slightly lower risk of side effects. Now it becomes a really sticky wicket -- what additional risk to this patient is acceptable in order to reduce the risks to other patients (due to getting better nursing care).
That last one is the sort that really is troublesome. We never like explicitly to risk one person to help multiple others, but we are often less troubled when we do it implicitly. I won't claim to be an ethics expert, so I'll leave it at that. But I think these scenarios embody real situations which will be faced, such as sometimes an expensive drug is better than a cheaper one & (not to say this is always or even often true, just that it isn't always false). Or more generally: health care reform will be complex, because health care is complex.
Bacteria can mobilize a fifth column
I recently had to deal with a bacterial upper respiratory infection. Something to ponder about such problems is that not only did the little nasty have to gain a foothold on my immune system, but it also had to elbow a lot of other bacteria out of the way. After all, my respiratory tract is open to the air and is far from sterile; there is a whole ecosystem of bugs which generally get along with me. For an infection to take hold, either one of the regular residents has to go bad or the newcomers must steal some space.
A recent abstract in PNAS (alas, not an open access paper) provides a fascinating window on how that elbowing takes place. Staphylococcus aureus (aka the home front) is a standard resident of the respiratory tract (which, of course, can be nasty on its own if it gets through the skin) which Streptococcus pneumoniae (charming moniker! aka the invaders) must push aside. It turns out that one weapon the invaders use is hydrogen peroxide (H2O2), a staple of many home medicine cabinets -- though not mine growing up; Dad still favors tincture of iodine (curiously, cuts & scrapes often went unreported!).
Okay, that seems straightforward. Well, except the question of why the invaders themselves don't suffer some blowback. But it actually gets more interesting, because it turns out the H2O2 dose is sub-lethal. Huh? The invaders come in with flame throwers but set them to warm & cozy?
But sub-lethal doesn't mean physiologically irrelevant. The dose is enough for the home front to worry, as H2O2 can cause all sorts of damage. Indeed, the dose is strong enough to set off the SOS system, a DNA damage response.
The SOS system has an interesting side angle. Many bacteria carry dormant viruses, better known as lysogenic phage, within their genome. These viral genomes are integrated within their hosts' DNA and generally keep quiet, getting a free replication ride every time their host divides. However, that free ride isn't much good if your host dies with you in it, so these phage listen to the SOS response -- and when they hear it they go into their lytic phase, pumping out lots of virus and generally killing their host on the way out.
So now we have a picture: spook the home front enough that a fifth column of phage rises within and destroys them. Nifty.
Except, we're back to the blowback problem -- unless the invaders are also free of lysogenic phage they're going to have the same problem. However, it turns out that H2O2 does not activate the SOS response in the invaders, because they apparently are resistant to H2O2's DNA-damaging effects.
Understanding that resistance is a next area for work. Potentially, disabling it would offer an interesting antibiotic angle -- an antibiotic that was specific for the invaders by letting them blow themselves up. That's a big stretch (and the economics of antibiotic development are horrendous -- hence very few companies try it or stay in it) so don't hold your breath (or cough) waiting for it -- but it is a fun aspect to ponder.
A recent abstract in PNAS (alas, not an open access paper) provides a fascinating window on how that elbowing takes place. Staphylococcus aureus (aka the home front) is a standard resident of the respiratory tract (which, of course, can be nasty on its own if it gets through the skin) which Streptococcus pneumoniae (charming moniker! aka the invaders) must push aside. It turns out that one weapon the invaders use is hydrogen peroxide (H2O2), a staple of many home medicine cabinets -- though not mine growing up; Dad still favors tincture of iodine (curiously, cuts & scrapes often went unreported!).
Okay, that seems straightforward. Well, except the question of why the invaders themselves don't suffer some blowback. But it actually gets more interesting, because it turns out the H2O2 dose is sub-lethal. Huh? The invaders come in with flame throwers but set them to warm & cozy?
But sub-lethal doesn't mean physiologically irrelevant. The dose is enough for the home front to worry, as H2O2 can cause all sorts of damage. Indeed, the dose is strong enough to set off the SOS system, a DNA damage response.
The SOS system has an interesting side angle. Many bacteria carry dormant viruses, better known as lysogenic phage, within their genome. These viral genomes are integrated within their hosts' DNA and generally keep quiet, getting a free replication ride every time their host divides. However, that free ride isn't much good if your host dies with you in it, so these phage listen to the SOS response -- and when they hear it they go into their lytic phase, pumping out lots of virus and generally killing their host on the way out.
So now we have a picture: spook the home front enough that a fifth column of phage rises within and destroys them. Nifty.
Except, we're back to the blowback problem -- unless the invaders are also free of lysogenic phage they're going to have the same problem. However, it turns out that H2O2 does not activate the SOS response in the invaders, because they apparently are resistant to H2O2's DNA-damaging effects.
Understanding that resistance is a next area for work. Potentially, disabling it would offer an interesting antibiotic angle -- an antibiotic that was specific for the invaders by letting them blow themselves up. That's a big stretch (and the economics of antibiotic development are horrendous -- hence very few companies try it or stay in it) so don't hold your breath (or cough) waiting for it -- but it is a fun aspect to ponder.
Monday, February 02, 2009
A Fatally Flawed Paper
I like to review manuscripts but don't do so very often. When I started this blog I thought I might often use it to play "If I had been the reviewer", but I haven't done that much. However, a paper came to my attention that I can't stop thinking about until I tackle it here.
As an aside, I find papers I review to fall into three categories. The first are very solid papers that I can find little to comment on; I might make a suggestion or two (often about data visualization), but if the core is solid there isn't much for the reviewer to do. The second category is the most frustrating: when I feel the paper is on the edges of my expertise & I start to question whether I should have agreed to review it (which is done after seeing an abstract). The third category is the one I can really dig into: seriously flawed papers. I think one of my reviews of a paper was approaching the length of the manuscript; the paper was badly flawed but there was a thread of substance that with a lot of work could be turned into something decent.
Anyway, I noticed this paper in the BioMedCentral Table of Contents extract which emailed to me weekly.
Sometimes when there has been some accident, a review of the circumstances leading up to it will reveal many opportunities for recognizing that a bad situation had been set up: the engineer ignored a stop signal or the dispatcher should have noticed the switch was set incorrectly. This paper, particularly one of its centerpiece findings, has that feel to it: there were many warning flags that something was amiss, but unfortunately the authors and the reviewers failed to see them.
When I first planned this critique, I was going to detail several examples. However, that would seem to lead to a very long post, so I will pick a few examples and claim that it is representative. If anyone wishes to challenge that claim, then I'll flesh out some more. Also, I feel the first example is particularly apropos because it is a bit of a centerpiece; it gets a lot of space (including a special figure) in the text.
It was this bit of text that caused me to raise my eyebrows as far as they could go (I wish I could do the Spock single-eyebrow raise, but I can't). The bolding is mine to emphasize the big surprises.
The first huge surprise is to find a kinase with so little sequence identity to its closest human counterpart. The DNA identity of human and chimp is routinely cited in the high 90 percent (how exactly you calculate it affects the final value) and they are our closest relatives. Finding a human-mouse ortholog identity of less than 31% would be stunning; for human-chimp it would be indescribably surprising. The second huge surprise is the claim of a hybrid Polo-CK1 kinase. The Polo box is a domain which recognizes phosphorylated peptides and is important in the activation & substrate recognition by Polo kinases. It is the signature of the Polo subfamily and has not been reported to be found on any other protein. The third surprise is in the dendrogram; it is claimed that this kinase has an affinity to CK1-type kinaess, but in their rooted dendrogram (source of rooting not explained, a serious error) this kinase is an outgroup to all of the other presented kinases! Without some true outgroups (ideally representatives of other key families), how can we tell what it is most similar to?
Now, a strong criticism of mine of this paper is that it relies too much on Ensembl-derived sequences and annotation. Ensembl is a great system & I have high respect for it, but it is also trying to do the very complex job of integrating a lot of other data with genomic sequences of varying quality and we are not scientists if we fully trust it to always be correct. It is much better to have a more definitive reference point; why rely on someone's hand sketched map if you have a USGS topographic section available? And for a solid anchor database, it is hard to beat the RefSeq human protein dataset. So, we take the sequence from their figure for this ORF
and our top hit is
That resolves all these questions: it's a straightforward ortholog of PLK3 (which explains the Polo boxes), not some noteworthy hybrid and the sequence identity is 90+% -- and that score is dropped a lot by some iffy regions like this
What's going on there? Well, most likely this is underlining the draft nature of the chimpanzee genome. I checked with TBLASTN, and there aren't ESTs around this region -- the chimp PLK3 is pretty much a pure gene prediction model -- a tough problem that has been tackled well but never perfectly. Plus, the underlying genomic data is, well, draft quality. Another TBLASTN search revealed that although this Ensembl prediction is from the middle of a large contig, the N-terminus of human PLK3 has a great match on another contig -- but from the same chromosome.
Okay, maybe that's a fluke. So here's another chimp kinase highlighted in the text
Again, the first thing to do is to search the ORF
Okay, so the human protein has been described previously: it is human protein kinase C zeta. Has the PB1 domain in PKCzeta and its implications been previously discussed? A quick PubMed search turned up two papers from earlier this decade (in Molecular Cell & JBC) which actually demonstrated the dimerization potential of the PKCzeta PB1 domain. So the PKC domain with a PB1 domain is not novel & noteworthy. What about the missing diacylglycerol-binding domain (that first big gap) in the chimp kinase? That could be interesting, so let's see what whether we can find any EST evidence to support it. Alas, the only EST evidence refutes it and identifies the gap as spurious(and both of these ESTs were deposited in October 2007 and the paper submitted in March 2008, so they are not an unfair criticism)
I checked in detail one more note (about the chimp protein ENSPTRP00000001185 and its human ortholog) about a domain architecture claimed to be unique to human & chimp due to a missing domain. Again, the RefSeq protein search revealed that the chimp protein is nearly identical to a known human kinase (MARK2) albeit greatly truncated -- and the missing domain is beyond the truncation point.
I haven't checked every kinase in the paper, but seeing the same classes of mistakes repeatedly doesn't give much hope. Comparing the human & chimp kinomes (or any other well-defined subset of genes) is a worthwhile enterprise -- so long as it is kept in mind that the chimp genome is a very rough draft and all appropriate computational controls are used. This paper, unfortunately, shows no awareness of either of these principles.
What irks me most about this sort of paper is that it gives all of us a bit of a black eye. Someone who saw the abstract & got excited would be in for a big letdown. It's hard enough to earn the respect of bench biologists without it being tossed away with poorly done analyses.
So what are the positive lessons to be learned? Here are a few tips
P.S. One way to put reviewers in a bad mood is to not supply your sequences. The supplementary materials for this paper do not have all of the ORFs; I pulled some out from their alignments with a custom script. Elsewhere via Google I found a collection linked to the work -- but with the whole predicted chimp proteome in it! Very unwieldy & slow to download!
P.P.S. For anyone interested in exploring further, here are the other sequences from the alignment in additional file 3. The number in the header was added by my script to indicate which alignment within that file the sequence was taken from.
As an aside, I find papers I review to fall into three categories. The first are very solid papers that I can find little to comment on; I might make a suggestion or two (often about data visualization), but if the core is solid there isn't much for the reviewer to do. The second category is the most frustrating: when I feel the paper is on the edges of my expertise & I start to question whether I should have agreed to review it (which is done after seeing an abstract). The third category is the one I can really dig into: seriously flawed papers. I think one of my reviews of a paper was approaching the length of the manuscript; the paper was badly flawed but there was a thread of substance that with a lot of work could be turned into something decent.
Anyway, I noticed this paper in the BioMedCentral Table of Contents extract which emailed to me weekly.
Comparative kinomics of human and chimpanzee reveals unique kinship and functional diversity generated by new domain combinations.. Now, back at MLNM I had for a while specialized in protein kinases, so it is a field of some interest. I hadn't kept up with the status of the chimpanzee genome sequencing, but there is a longstanding familial interest in this species so that was another angle of interest.
Sometimes when there has been some accident, a review of the circumstances leading up to it will reveal many opportunities for recognizing that a bad situation had been set up: the engineer ignored a stop signal or the dispatcher should have noticed the switch was set incorrectly. This paper, particularly one of its centerpiece findings, has that feel to it: there were many warning flags that something was amiss, but unfortunately the authors and the reviewers failed to see them.
When I first planned this critique, I was going to detail several examples. However, that would seem to lead to a very long post, so I will pick a few examples and claim that it is representative. If anyone wishes to challenge that claim, then I'll flesh out some more. Also, I feel the first example is particularly apropos because it is a bit of a centerpiece; it gets a lot of space (including a special figure) in the text.
It was this bit of text that caused me to raise my eyebrows as far as they could go (I wish I could do the Spock single-eyebrow raise, but I can't). The bolding is mine to emphasize the big surprises.
For example, a chimpanzee kinase classified as casein kinase 1 (ENSPTRP00000001150) on the basis of significant sequence similarity (31%) of the catalytic domain and excellent e-value (2e-16) with the casein kinase 1 from human. However this chimp kinase has a POLO BOX tethered to the kinase catalytic domain.
Thus this chimp kinase represents a hybrid CK1_POLO kinase. Interestingly ENSEMBL reports that ENSPTRP00000001150 has a high similarity with the human kinase ENSP00000361275. However, according to our classification protocol ENSP00000361275 is classified as a POLO kinase on the basis of 52% sequence identity with classical POLO kinases and excellent e-value of e-112. Figure 1 shows the dendrogram of the CK1 sub-family of kinases and it highlights the significant divergence of chimp homologue from its counterparts in other organisms
The first huge surprise is to find a kinase with so little sequence identity to its closest human counterpart. The DNA identity of human and chimp is routinely cited in the high 90 percent (how exactly you calculate it affects the final value) and they are our closest relatives. Finding a human-mouse ortholog identity of less than 31% would be stunning; for human-chimp it would be indescribably surprising. The second huge surprise is the claim of a hybrid Polo-CK1 kinase. The Polo box is a domain which recognizes phosphorylated peptides and is important in the activation & substrate recognition by Polo kinases. It is the signature of the Polo subfamily and has not been reported to be found on any other protein. The third surprise is in the dendrogram; it is claimed that this kinase has an affinity to CK1-type kinaess, but in their rooted dendrogram (source of rooting not explained, a serious error) this kinase is an outgroup to all of the other presented kinases! Without some true outgroups (ideally representatives of other key families), how can we tell what it is most similar to?
Now, a strong criticism of mine of this paper is that it relies too much on Ensembl-derived sequences and annotation. Ensembl is a great system & I have high respect for it, but it is also trying to do the very complex job of integrating a lot of other data with genomic sequences of varying quality and we are not scientists if we fully trust it to always be correct. It is much better to have a more definitive reference point; why rely on someone's hand sketched map if you have a USGS topographic section available? And for a solid anchor database, it is hard to beat the RefSeq human protein dataset. So, we take the sequence from their figure for this ORF
>3|Chimp|ENSPTRP00000001150
SLAHIWKARHTLLEPEVRYYLRQILSGLKYLHQRGILHRDLKLGNFFITENMELKVGDF
GLAARLEPPEQRKKTICGTPNYVAPEVLLRQGHGPEADVWSLGCVMYTLLCGSPPFETA
DLKETYRCIKQVHYTLPASLSLPARQLLAAILRASPRDRPSIDQILRHDFFTKGYTPDR
LPISSCVTVPDLTPPNPARSLFAKVTKSLFGRKKKKSKNHAQESDEVSGLVSGLMRTSV
GHQDARPEAPAASGPAPVSLVETAPEDSSPRGTLASSGDGFEEGLTVATVVESALCALR
NCVAFMPPAEQNPAPLAQPEPLVWVSKWVDYGGDLPSVEEVEVPAPPLLLQWVKTDQAL
LMLFSDGTVQVNFYGDHTKLILSGWEPLLVTFVARNRSACTYLASHLRQLGCSPDLRQRLRYALRLLRDRSPA
and our top hit is
GENE ID: 1263 PLK3 | polo-like kinase 3 (Drosophila) [Homo sapiens]
(Over 10 PubMed links)
Score = 663 bits (1710), Expect = 0.0, Method: Compositional matrix adjust.
Identities = 328/362 (90%), Positives = 335/362 (92%), Gaps = 14/362 (3%)
That resolves all these questions: it's a straightforward ortholog of PLK3 (which explains the Polo boxes), not some noteworthy hybrid and the sequence identity is 90+% -- and that score is dropped a lot by some iffy regions like this
Query 301 MPPAEQNPAPLAQPEPLVWVSKWVDYGGDLPSVEEVEVPAPPLLLQWVKTDQALLMLFSD 360
MPPAEQNPAPLAQPEPLVWVSKWVDY + + + + +LF+D
Sbjct 445 MPPAEQNPAPLAQPEPLVWVSKWVDYSNKFG-------------FGYQLSSRRVAVLFND 491
Query 361 GT 362
GT
Sbjct 492 GT 493
Score = 176 bits (446), Expect = 1e-43, Method: Compositional matrix adjust.
Identities = 83/91 (91%), Positives = 87/91 (95%), Gaps = 0/91 (0%)
Query 319 WVSKWVDYGGDLPSVEEVEVPAPPLLLQWVKTDQALLMLFSDGTVQVNFYGDHTKLILSG 378
++ + + GGDLPSVEEVEVPAPPLLLQWVKTDQALLMLFSDGTVQVNFYGDHTKLILSG
Sbjct 538 YMEQHLMKGGDLPSVEEVEVPAPPLLLQWVKTDQALLMLFSDGTVQVNFYGDHTKLILSG 597
Query 379 WEPLLVTFVARNRSACTYLASHLRQLGCSPD 409
WEPLLVTFVARNRSACTYLASHLRQLGCSPD
Sbjct 598 WEPLLVTFVARNRSACTYLASHLRQLGCSPD 628
What's going on there? Well, most likely this is underlining the draft nature of the chimpanzee genome. I checked with TBLASTN, and there aren't ESTs around this region -- the chimp PLK3 is pretty much a pure gene prediction model -- a tough problem that has been tackled well but never perfectly. Plus, the underlying genomic data is, well, draft quality. Another TBLASTN search revealed that although this Ensembl prediction is from the middle of a large contig, the N-terminus of human PLK3 has a great match on another contig -- but from the same chromosome.
Okay, maybe that's a fluke. So here's another chimp kinase highlighted in the text
A protein (ENSPTRP00000000076), classified under PKC subfamily, is composed of a PB1 domain followed by the protein kinase domain which is followed by a protein kinase C terminal domain (Figure 3a1). The PB1 domain is present in many eukaryotic cytoplasmic signalling proteins and is responsible, although not systematically, in the formation of PB1 dimers [25]. It thus serves as a molecular recognition module. This architecture is known so far only in an atypical PKC of Phallusia mammilata, a sea squirt. Our analysis identified two chimpanzee PKCs and a human PKC with a similar architecture, in which a phorbol esters/diacylglycerol binding domain is inserted between the PB1 and the protein kinase domain. The presence of the phorbol esters/diacylglycerol binding domain in combination with the protein kinase and a PKC terminal domain indicates that it is probably responsible for the recruitment of diacylglycerol, which in turns might be involved in activation of the kinase. The deletion of this domain in chimpanzee PKC (ENSPTRP00000000076) implies that the recruitment of diacylglycerol might be achieved by an external interacting module.
Again, the first thing to do is to search the ORF
against human RefSeq to get our bearings.
>1|Chimp|ENSPTRP00000000076
MPSRTGPKMEGSGGRVRLKAHYGGDIFITSVDAATTFEELCEEVRDMCRLHQQHPL
TLKWVDSEGDPCTVSSQMELEEAFRLARQCRDEGLIIHVFPSTPEQPGLPCPGEDK
SIYRRGARRWRKLYCANGHLFQAKRFNRDSVMPSQEPPVDDKNEDADLPSEETDGI
AYISSSRKHDSIKDDSEDLKPVIDGMDGIKISQGLGLQDFDLIRVIGRGSYAKVLL
VRLKKNDQIYAMKVVKKELVHDDETTSRLFLVIEYVNGGDLMFHMQRQRKLPEEHA
RFYAAEICIALNFLHERGIIYRDLKLDNVLLDADGHIKLTDYGMCKEGLGPGDTTS
TFCGTPNYIAPEILRGEEYGFSVDWWALGVLMFEMMAGRSPFDIITDNPDMNTEDY
LFQVILEKPIRIPRFLSVKASHVLKGFLNKDPKERLGCRPQTGFSDIKSHAFFRSI
DWDLLEKKQALPPFQPQITDDYGLDNFDTQFTSEPVQLTPDDEDAIKRIDQSEFEG
FEYINPLLLSTEESV
GENE ID: 5590 PRKCZ | protein kinase C, zeta [Homo sapiens]
(Over 100 PubMed links)
Score = 1031 bits (2665), Expect = 0.0, Method: Compositional matrix adjust.
Identities = 518/592 (87%), Positives = 518/592 (87%), Gaps = 73/592 (12%)
Query 1 MPSRTGPKMEGSGGRVRLKAHYGGDIFITSVDAATTFEELCEEVRDMCRLHQQHPLTLKW 60
MPSRTGPKMEGSGGRVRLKAHYGGDIFITSVDAATTFEELCEEVRDMCRLHQQHPLTLKW
Sbjct 1 MPSRTGPKMEGSGGRVRLKAHYGGDIFITSVDAATTFEELCEEVRDMCRLHQQHPLTLKW 60
Query 61 VDSEGDPCTVSSQMELEEAFRLARQCRDEGLIIHVFPSTPEQPGLPCPGEDKSIYRRGAR 120
VDSEGDPCTVSSQMELEEAFRLARQCRDEGLIIHVFPSTPEQPGLPCPGEDKSIYRRGAR
Sbjct 61 VDSEGDPCTVSSQMELEEAFRLARQCRDEGLIIHVFPSTPEQPGLPCPGEDKSIYRRGAR 120
Query 121 RWRKLYCANGHLFQAKRFNR---------------------------------------- 140
RWRKLY ANGHLFQAKRFNR
Sbjct 121 RWRKLYRANGHLFQAKRFNRRAYCGQCSERIWGLARQGYRCINCKLLVHKRCHGLVPLTC 180
Query 141 ----DSVMPSQEPPVDDKNEDADLPSEETDGIAYISSSRKHDSIKDDSEDLKPVIDGMDG 196
DSVMPSQEPPVDDKNEDADLPSEETDGIAYISSSRKHDSIKDDSEDLKPVIDGMDG
Sbjct 181 RKHMDSVMPSQEPPVDDKNEDADLPSEETDGIAYISSSRKHDSIKDDSEDLKPVIDGMDG 240
Query 197 IKISQGLGLQDFDLIRVIGRGSYAKVLLVRLKKNDQIYAMKVVKKELVHDDE-------- 248
IKISQGLGLQDFDLIRVIGRGSYAKVLLVRLKKNDQIYAMKVVKKELVHDDE
Sbjct 241 IKISQGLGLQDFDLIRVIGRGSYAKVLLVRLKKNDQIYAMKVVKKELVHDDEDIDWVQTE 300
Query 249 ---------------------TTSRLFLVIEYVNGGDLMFHMQRQRKLPEEHARFYAAEI 287
TTSRLFLVIEYVNGGDLMFHMQRQRKLPEEHARFYAAEI
Sbjct 301 KHVFEQASSNPFLVGLHSCFQTTSRLFLVIEYVNGGDLMFHMQRQRKLPEEHARFYAAEI 360
Query 288 CIALNFLHERGIIYRDLKLDNVLLDADGHIKLTDYGMCKEGLGPGDTTSTFCGTPNYIAP 347
CIALNFLHERGIIYRDLKLDNVLLDADGHIKLTDYGMCKEGLGPGDTTSTFCGTPNYIAP
Sbjct 361 CIALNFLHERGIIYRDLKLDNVLLDADGHIKLTDYGMCKEGLGPGDTTSTFCGTPNYIAP 420
Query 348 EILRGEEYGFSVDWWALGVLMFEMMAGRSPFDIITDNPDMNTEDYLFQVILEKPIRIPRF 407
EILRGEEYGFSVDWWALGVLMFEMMAGRSPFDIITDNPDMNTEDYLFQVILEKPIRIPRF
Sbjct 421 EILRGEEYGFSVDWWALGVLMFEMMAGRSPFDIITDNPDMNTEDYLFQVILEKPIRIPRF 480
Query 408 LSVKASHVLKGFLNKDPKERLGCRPQTGFSDIKSHAFFRSIDWDLLEKKQALPPFQPQIT 467
LSVKASHVLKGFLNKDPKERLGCRPQTGFSDIKSHAFFRSIDWDLLEKKQALPPFQPQIT
Sbjct 481 LSVKASHVLKGFLNKDPKERLGCRPQTGFSDIKSHAFFRSIDWDLLEKKQALPPFQPQIT 540
Query 468 DDYGLDNFDTQFTSEPVQLTPDDEDAIKRIDQSEFEGFEYINPLLLSTEESV 519
DDYGLDNFDTQFTSEPVQLTPDDEDAIKRIDQSEFEGFEYINPLLLSTEESV
Sbjct 541 DDYGLDNFDTQFTSEPVQLTPDDEDAIKRIDQSEFEGFEYINPLLLSTEESV 592
Okay, so the human protein has been described previously: it is human protein kinase C zeta. Has the PB1 domain in PKCzeta and its implications been previously discussed? A quick PubMed search turned up two papers from earlier this decade (in Molecular Cell & JBC) which actually demonstrated the dimerization potential of the PKCzeta PB1 domain. So the PKC domain with a PB1 domain is not novel & noteworthy. What about the missing diacylglycerol-binding domain (that first big gap) in the chimp kinase? That could be interesting, so let's see what whether we can find any EST evidence to support it. Alas, the only EST evidence refutes it and identifies the gap as spurious(and both of these ESTs were deposited in October 2007 and the paper submitted in March 2008, so they are not an unfair criticism)
>dbj|DC524857.1| DC524857 chimpanzee brain cDNA library PflB Pan troglodytes verus
cDNA clone PflB8010 5', mRNA sequence.
Length=404
Score = 108 bits (270), Expect(2) = 3e-31, Method: Composition-based stats.
Identities = 57/102 (55%), Positives = 58/102 (56%), Gaps = 44/102 (43%)
Frame = +2
Query 112 KSIYRRGARRWRKLYCANGHLFQAKRFNR------------------------------- 140
+SIYRRGARRWRKLYCANGHLFQAKRFNR
Sbjct 38 ESIYRRGARRWRKLYCANGHLFQAKRFNRRAYCGQCSERIWGLARQGYRCINCKLLVHKR 217
Query 141 -------------DSVMPSQEPPVDDKNEDADLPSEETDGIA 169
DSVMPSQEPPVDDKNEDADLPSEETDGIA
Sbjct 218 CHGLVPLTCRKHMDSVMPSQEPPVDDKNEDADLPSEETDGIA 343
Score = 42.7 bits (99), Expect(2) = 3e-31, Method: Compositional matrix adjust.
Identities = 20/22 (90%), Positives = 21/22 (95%), Gaps = 0/22 (0%)
Frame = +3
Query 168 IAYISSSRKHDSIKDDSEDLKP 189
+ YISSSRKHDSIKDDSEDLKP
Sbjct 339 LLYISSSRKHDSIKDDSEDLKP 404
>dbj|DC519886.1| DC519886 chimpanzee brain cDNA library PccB Pan troglodytes verus
cDNA clone PccB0482 5', mRNA sequence.
Length=612
Score = 114 bits (284), Expect = 3e-26, Method: Compositional matrix adjust.
Identities = 63/107 (58%), Positives = 63/107 (58%), Gaps = 44/107 (41%)
Frame = +2
Query 113 SIYRRGARRWRKLYCANGHLFQAKRFNR-------------------------------- 140
SIYRRGARRWRKLYCANGHLFQAKRFNR
Sbjct 290 SIYRRGARRWRKLYCANGHLFQAKRFNRRAYCGQCSERIWGLARQGYRCINCKLLVHKRC 469
Query 141 ------------DSVMPSQEPPVDDKNEDADLPSEETDGIAYISSSR 175
DSVMPSQEPPVDDKNEDADLPSEETDGIAYISSSR
Sbjct 470 HGLVPLTCRKHMDSVMPSQEPPVDDKNEDADLPSEETDGIAYISSSR 610
I checked in detail one more note (about the chimp protein ENSPTRP00000001185 and its human ortholog) about a domain architecture claimed to be unique to human & chimp due to a missing domain. Again, the RefSeq protein search revealed that the chimp protein is nearly identical to a known human kinase (MARK2) albeit greatly truncated -- and the missing domain is beyond the truncation point.
I haven't checked every kinase in the paper, but seeing the same classes of mistakes repeatedly doesn't give much hope. Comparing the human & chimp kinomes (or any other well-defined subset of genes) is a worthwhile enterprise -- so long as it is kept in mind that the chimp genome is a very rough draft and all appropriate computational controls are used. This paper, unfortunately, shows no awareness of either of these principles.
What irks me most about this sort of paper is that it gives all of us a bit of a black eye. Someone who saw the abstract & got excited would be in for a big letdown. It's hard enough to earn the respect of bench biologists without it being tossed away with poorly done analyses.
So what are the positive lessons to be learned? Here are a few tips
- Always try to find meaningful biological names for your sequences. Use them in your figures & search them in the literature like a bloodhound.
- Always check genomic predictions against EST & cDNA databases.
- Always try to root your phylogenetic trees, unless you have a really good reason not to do so. And, if your tree is rooted, you must explain how you rooted it
- If your results sound amazing, take a deep breath & think of several tests that could debunk them. Then do those ten tests. If they survive, go to bed & think of another batch of tests.
P.S. One way to put reviewers in a bad mood is to not supply your sequences. The supplementary materials for this paper do not have all of the ORFs; I pulled some out from their alignments with a custom script. Elsewhere via Google I found a collection linked to the work -- but with the whole predicted chimp proteome in it! Very unwieldy & slow to download!
P.P.S. For anyone interested in exploring further, here are the other sequences from the alignment in additional file 3. The number in the header was added by my script to indicate which alignment within that file the sequence was taken from.
>2|Chimp|ENSPTRP00000019171
MSAEVRLRRLQQLVLDPGFLGLEPLLDLLLGVHQELGASELAQDKYVADFLQWAEPIVVRL
KEVRLQRDDFEILKVIGRGAFSEVAVVKMKQTGQVYAMKIMNKWDMLKRGEVSCFREERDV
LVNGDRRWITQLHFAFQDENYLYLVMEYYVGGDLLTLLSKFGERIPAEMARFYLAEIVMAI
DSVHRLGYVHRDIKPDNILLDRCGHIRLADFGSCLKLRADGTVRSLVAVGTPDYLSPEILQ
AVGGGPGTGSYGPECDWWALGVFAYEMFYGQTPFYADSTAETYGKIVHYKEHLSLPLVDEG
VPEEARDFIQRLLCPPETRLGRGGAGDFRTHPFFFGLDWDGLRDSVPPFTPDFEGATDTCN
FDLVEDGLTAMVSGGGETLSDIREGAPLGVHLPFVGYSYSCMALRDSEVPGPTPMELEAEQ
LLEPHVQAPSLEPSVSPQDETAEVAVPAAVPAAEAEAEVTLRELQEALEEEVLTRQSLSRE
MEAIRTDNQNFASQLREAEARNRDLEAHVRQLQERMELLQAEGATAVTGVPSPRATDPPSH
VPWPGLSXALSLLLFAVVLSRAAALGCLGLVAPAGXLXAVWRRPGAARAPX
>4|Chimp|ENSPTRP00000011569
MSDVAIVKEGWLHKRGEYIKTWRPRYFLLKNDGTFIGYKERPQDVDQREAPLNNFSVAQCQ
LMKTERPRPNTFIIRCLQWTTVIERTFHVETPEEREEWTTAIQTVADGLKKQEEEEMDFRS
GSPSDNSGAEEMEVSLAKPKHRVTMNEFEYLKLLGKGTFGKVILVKEKATGRYYAMKILKK
EVIVAKDEVAHTLTENRVLQNSRHPFLTALKYSFQTHDRLCFVMEYANGGELFFHLSRERV
FSEDRARFYGAEIVSALDYLHSEKNVVYRDLKLENLMLDKDGHIKITDFGLCKEGIKDGAT
MKTFCGTSEYLAPRLSPPFKPQVTSETDTRYFDEEFTAQMITITPP
DQDDSMECVDSERRPHFPQFSYSASGTA