Monday, September 14, 2009
Codon Optimization is Not Bunk?
In a previous post I asked "Is Codon Optimization Bunk?", reflecting on a paper which showed that the typical rules for codon optimization appeared not to be highly predictive of the expression of GFP constructs. A paper released in PLoS One sheds new light on this question.
A quick review. To the first approximation, the genetic code consists of 3 nucleotide units called codon; there are 64 possible codons. Twenty amino acids plus stop are specified by these codons (again, 1st approximation). So, either a lot of codons are never used or at least some codons mean the same thing. In the coding system used by the vast majority of organisms, two amino acids are encoded with a single codon whereas all the others have 2, 3, 4 or 6 codons apiece (and stop gets 3). For amino acids with 2, 3 or 4 codons, it is the third position that makes the difference; for the three that have 6, they have one block of 4 which follows this pattern and one set of two which also differ from each other in the third position. For two amino acids with 6 codons, the two groups are next to each other so that you can think of the change between the blocks as a change in the second position; Ser is very strange in that the two blocks of codons are terribly like each other. For amino acids with two codons, the 3rd position is either a purine (A,G) or pyrimidine (C,T). For a given amino acid, these codons are not used equally by a given organism; the pattern of bias in codon usage is quite distinct for an organism and its close cousins and this has wide effects on the genome (and vice versa). For example, in some Streptomyces I have the codon bias pattern pretty much memorized: use G or C in the third position and you'll almost always pick a frequent codon; use A or T and you've picked a rarity. Some other organisms skew much the reverse; they like A or T in the third position.
Furthermore, within a species the genes can even be divided further. In E.coli, for example, there are roughly three classes of genes each with a distinctive codon usage signature. One class is rich in important proteins which the cell probably needs a lot of, the second class seems to have many proteins which see only a soupcon of expression and the third class is rich in proteins likely to have been recently acquired from other species.
So, it was natural to infer that this mattered for protein expression. In particular, if you try to express a protein from one species in another. Some species seemed to care more than others. E.coli has a reputation for being finicky and had one of the best studied systems. Not only did changing the codon usage over to a more E.coli system seem to help some proteins, but truly rare codons (used less than 5% of the time, though that is an arbitrary threshold) could cause all sorts of trouble.
However, the question remained how to optimize. Given all those interchangeable codons, a synthetic gene could have many permutations. Several major camps emerged with many variants, particularly amongst the gene synthesis companies. One school of thought said "maximize, maximize, maximize" -- pick the most frequently used codons in the target species. A second school said "context matters" -- and went to maximize the codon pair useage. A third school said "match the source!", meaning make the codon usage of the new coding sequence in the new species resemble the codon usage of the old coding region in the old species. This hedged for possible requirements for rare codons to ensure proper folding. Yet another school (which I belonged to) urged "balance", and chose to make the new coding region resemble a "typical" target species gene by sampling the codons based on their frequencies, throwing out the truly rare ones. A logic here is that hammering the same codon -- and thereby the same tRNA -- over and over would make that codon as good as rare.
The new work has some crumbs for many of these camps but not many; it suggests much was wrong with each -- or perhaps, the same thing was wrong with each. The problem is that even with these systems some proteins just didn't express well, leaving everyone scratching their heads. The GFP work seemed to suggest that the effects of codon usage were unpredictable if present, and in any case other factors, such as secondary structure near the ribosome, were what counted.
What the new work did is synthesize a modest number (40) of versions of two very different proteins (a single-chain antibody and an enzyme, each version specifying the same protein sequence but with a different set of codons. Within each type of protein, the expression varied over two logs; clearly something matters. Furthermore, they divided some of the best and worst expressors into thirds and made chimaeras, head of good and tail of bad (and vice versa). Some chimaeras seemed to have expression resembling their parent for the head end but others seemed to inherit from the tail end parent. So the GFP-based "ribosome binding site neighborhood secondary structure matters" hypothesis did not fare well with these tests.
After some computational slicing-and-dicing, what they did come up with is that codon usage matters. The twist is that it isn't matching the best used codons (CAI) that's important, as shown in the figure at the top which I'm fair-using. The codons that matter aren't necessarily the most used codons, but when cross-referenced with some data on which codons are most sensitive to starvation conditions the jackpot lights come on. When you use these as your guide, as shown below, the predictive ability is quite striking. In retrospect, this makes total sense: expressing a single protein at very high levels is probably going to deplete a number of amino acids. Indeed, this was the logic of the sampling approach. But, I don't believe any proponent of that approach ever predicted this.
Furthermore, not only does this work on the training set but new coding regions were prepared to test the model, and these new versions had expression levels consistent with the new model.
What of secondary structure near the ribosome? In some of the single-chain antibody constructs an effect could be seen, but it appears the codon usage effect is dominant. In conversations with the authors (more on this below), they mentioned that GFP is easy to code with secondary structure near the ribosome binding site; this is just an interesting interaction of the genetic code with the amino acid sequence of GFP. Since it is easy in this case to stumble on secondary structure, that effect shows up in that dataset.
This is all very interesting, but it is also practical. On the pure biology side, it does suggest that studying starvation is applicable to studying high level protein expression, which should enable further studies on this important problem. On the protein expression side, it suggests a new approach to optimizing expression of synthetic constructs. A catch however: this work was run by DNA2.0 and they have filed for patents and at least some of these patents have issued (e.g. US 7561972 and US 7561973). I mention this only to note that it is so and to give some starting points for reading further; clearly I have neither the expertise nor responsibility to interpret the legal meaning of patents.
Which brings us to one final note: this paper represents my first embargo! A representative of DNA2.0 contacted me back when my "bunk" post was written to mention that this work was going to emerge, and finally last week the curtain was lifted. Obviously they know how to keep a geek in suspense! They sent me the manuscript and engaged in a teleconference with the only proviso being that I continued to keep silent until the paper issued. I'm not sure I would have caught this paper otherwise, so I'm glad they alerted me; though clearly both the paper and this post are not bad press for DNA2.0. Good luck to them! Now that I'm on the other side of the fence, I'll buy my synthetic genes from anyone with a good price and a good design rationale.
Mark Welch, Sridhar Govindarajan, Jon E. Ness, Alan Villalobos, Austin Gurney, Jeremy Minshull1, Claes Gustafsson (2009). Design Parameters to Control Synthetic Gene Expression in Escherichia coli PLoS One, 4 (9) : 10.1371/journal.pone.0007002