Tuesday, April 21, 2009

Is Codon Optimization Bunk?

There is a very interesting paper in Science from a week ago which hearkens back to my gene synthesis days at Codon. But first, some background.

The genetic code (at first approximation) uses 64 codons to encode 21 different signals; hence there are some choices as to which codon to use. Amino acids and stop can have 1,2,3,4 or 6 codons in the standard scheme of things. But, those codons are rarely used with equal frequency. Leucine, for example, has 6 codons and some are rarely used and others often. Which codons are preferred and disfavored, and the degree to which this is true, depends on the organism. In the extreme, a codon can actually go so out of favor it goes extinct & can no longer be used, and sometimes it is later reassigned to something else; hence some of the more tidy codes in certain organisms.

A further observation is that the more favored codons correspond to more abundant tRNAs and less favored ones to less abundant tRNAs. Furthermore, highly expressed genes are often rich in favored codons and lowly expressed ones much more likely to use rare ones. To complete the picture, in organisms such as E.coli there are genes which don't seem to follow the usual pattern -- and these are often associated with mobile elements and phage or have other suggestions that they may be recent acquisitions from another species.

A practical application of this is to codon optimize genes. If you are having a gene built to express a protein in a foreign host, then it would seem apropos to adjust the codon usage to the local dialect, which usually still leaves plenty of room to accommodate other wishes (such as avoiding the recognition sites for specific restriction enzymes). There are at least four major schemes for doing this, with different gene synthesis vendors preferring one or the other

  • CAI Maximization. CAI is a measure of usage of preferred codons; this strategy tries to maximize the statistic by using the most preferred codons. Logic: if these are the most preferred codons, and highly expressed genes are rich in them, why not do the same?

  • Codon sampling. This strategy (which is what Codon Devices offered) samples from a set of codons with probabilities proportional to their usage in the organism, after first zeroing out the very rare codons and renormalizing the table. Logic: avoid the rare ones, but don't hammer the better ones either; balance is always good

  • Dicodon optimization. In addition to codons showing preferences, there's also a pattern by which adjacent codons pair slightly non-randomly. One particular example; very rare codons are very unlikely to be followed by another very rare codon. Logic: even better approach to "when in Rome..." than either of the two above

  • Codon frequency matching. Roughly, this means look at the native mRNA and its uses of codons and ape this in the target species; a codon which is rare in the native should be replaced with one rare in the target. Logic: some rare codons may just help fold things properly


A related strategy worth mentioning are special expression strains which express extra copies of the rare tRNAs.

There is a lot of literature on codon optimization, and most of it suffers from the same flaw. Most papers describe taking one ORF, re-synthesizing it with a particular optimization scheme, and then comparing the two. One problem with this is the small N and the potential for publication bias (do people publish less frequently when this fails to work?). Furthermore, it could well be that the resynthesized design changed something else, and the codon optimization is really unimportant. A few papers deviate from this plan & there has been a hint from the structural genomics community of surveying their data (as they often codon optimized), but systematic studies aren't common.

Now in Science comes the sort of paper that starts to be systematic

Coding-Sequence Determinants of Gene Expression in Escherichia coli
Grzegorz Kudla, Andrew W. Murray, David Tollervey, and Joshua B. Plotkin
Science 10 April 2009: 255-258.


In short, they generated a library of GFP variants in which the particular codon used was varied randomly and then expressed these from a standard sort of expression vector in E.coli. The summary of their results is that codon usage didn't correlate with GFP brightness (expression), but that the key factor is avoidance of secondary structure near the beginning of the ORF.

It's a good approach, but a question is how general is the result. Is GFP a special protein in some way? Why do the rare tRNA-expressing strains sometimes help with protein expression? And most importantly, does this apply broadly or is it specific to E.coli and relatives?

This last point is important in the context of certain projects. E.coli and Saccharomyces have their codon preferences, but if you want to see an extreme preference, look at Streptomyces and its kin. These are important producers of antibiotics and other natural product medications, and it turns out that the codon usage table is easy to remember: just use G or C in the 3rd position. In one species I looked at, it was around 95% of all codons followed that rule.

This has the effect of making the G+C content of the entire ORF quite high, which engenders further problems. High G+C DNA can be difficult to assemble (or amplify) via PCR and it sequences badly. Furthermore, such a limited choice of codons means that anything resembling a repeat at the protein level will create a repeat at the DNA level, and even very short repeats can be problematic for gene synthesis. Long runs of G's can also be problematic for oligonucleotide synthesizers (or so I've been told). From a company's perspective, this is also a problem because customers don't really care about it and don't understand why you price some genes higher than others.

So, would the same strategy work in Streptomyces? If so, one could avoid synthesizing hyper-G+C genes and go with more balanced ones, reducing costs and the time to produce the genes. But, someone would need to make the leap and repeat Kudla et al strategy in some of these target organisms.

8 comments:

LQ said...

Nice insightful comments. I was hoping for someone related to the field to comment on this paper.

Regarding your questions on the generality of the findings in the Science paper, my opinion is that:
1. GFP is not special in this respect
2. Codon bias is still important though probably in far fewer cases than previously thought; hence the role for those rare tRNAs-expressing strains
3. E. coli is probably not unique, though I would doubt whether the findings will be somewhat different with eukaryotes or even organiams with extreme GC bias like Streptomyces (though I couldn't find data to support my opinion on this one)

The most interesting aspect of the publication is not the contents, but that it is published in Science despite there being no new ideas at all. There are plenty of reports on significantly increased protein expression through reducing the secondary structure around the AUG start codon. In a few papers, the authors performed large-scale experiments generating data sets similar to the present Science paper. And yes, most if not all of the previous reports involved expressing some foreign proteins (not necessarily GFP) using various commonly used lab E. coli strains.

I must admit that it'll take a Science paper to debunk the codon optimization myth, or rather, the exaggerated value of codon optimization in improving protein expression. Still, I wish that the authors had done more to answer the questions that Keith had raised.

P.S. The corresponding author of the paper, JB Plotkin, is also the first author of the famous (or infamous) Nature paper (doi:10.1038/nature02458) on the very suspect methodology of detecting selection using a single genome sequence. I'm not trying to cast any doubts over the accuracy or validity of the Science paper; after all, the findings are well collaborated by plenty of other existing literature :p

Anonymous said...

So, in the grand scheme of things, I guess it also depends on the business model of the gene synthesis company you might be using as to how well they can synthesize native sequences (or purport) as to whether they push 'Codon' optimization on you to make it easier for themselves. Thus to be able to say that they were able to build what you asked them to make. Not all gene synthesis companies are created equal as we now well know. Millions of years of evolution is a pretty good indicator of codon usage in my book and companies that push 'Codon' optimization (by offering a dramitically lower price) are usually showing their inability to make the tough stuff no matter how much puffing they do in public. I guess market forces will determine in the end the companies that will survive.

Keith Robison said...

Hmm, I'm guessing that last comment came from a Codon Devices customer we failed to leave gruntled. To the extent I was responsible, I do apologize. We never knowingly oversold our abilities to customers, but that doesn't absolve us from failing to deliver on time -- we were certainly guilty of inadvertantly overestimating our ability.

That said, I do have a little experience farming work out to surviving competitors and what we classified as difficult they always classified the same way, and at least one time we tried to farm out some tough stuff the other company refused to take the order. Everyone in the business finds codon optimization useful for making hard problems easier, thereby delivering to customers more reliably, faster & cheaper.

Millions of years of evolution aren't a bad guide, but once you've moved to a heterologous expression system it's a whole new ball of wax.

Anonymous said...

Nice post!

Also, this paper shows that removing too many pauses (i.e. rare codons)is detrimental to protein solubility.

http://www.microbialcellfactories.com/content/8/1/41

Unknown said...

"Millions of years of evolution aren't a bad guide," but, how good those companies understand the use of codon bias? Do they know how it affects the coupling of transcription and translation? The effect on targeting and folding? Put it in one word: who can give the warrant to say once you changed, you will get soulble expression?

Unknown said...

K,

what do you think now?

this following paper showed codon bias:
http://www.ncbi.nlm.nih.gov/pubmed/19759823

and now this one (a bit more ambiguous?):
http://www.ncbi.nlm.nih.gov/pubmed/21516083

trose69 said...
This comment has been removed by the author.
Anonymous said...

steff2j:

to be fair one should mention that the first of these papers is from DNA 2.0.