The genetic code (at first approximation) uses 64 codons to encode 21 different signals; hence there are some choices as to which codon to use. Amino acids and stop can have 1,2,3,4 or 6 codons in the standard scheme of things. But, those codons are rarely used with equal frequency. Leucine, for example, has 6 codons and some are rarely used and others often. Which codons are preferred and disfavored, and the degree to which this is true, depends on the organism. In the extreme, a codon can actually go so out of favor it goes extinct & can no longer be used, and sometimes it is later reassigned to something else; hence some of the more tidy codes in certain organisms.
A further observation is that the more favored codons correspond to more abundant tRNAs and less favored ones to less abundant tRNAs. Furthermore, highly expressed genes are often rich in favored codons and lowly expressed ones much more likely to use rare ones. To complete the picture, in organisms such as E.coli there are genes which don't seem to follow the usual pattern -- and these are often associated with mobile elements and phage or have other suggestions that they may be recent acquisitions from another species.
A practical application of this is to codon optimize genes. If you are having a gene built to express a protein in a foreign host, then it would seem apropos to adjust the codon usage to the local dialect, which usually still leaves plenty of room to accommodate other wishes (such as avoiding the recognition sites for specific restriction enzymes). There are at least four major schemes for doing this, with different gene synthesis vendors preferring one or the other
- CAI Maximization. CAI is a measure of usage of preferred codons; this strategy tries to maximize the statistic by using the most preferred codons. Logic: if these are the most preferred codons, and highly expressed genes are rich in them, why not do the same?
- Codon sampling. This strategy (which is what Codon Devices offered) samples from a set of codons with probabilities proportional to their usage in the organism, after first zeroing out the very rare codons and renormalizing the table. Logic: avoid the rare ones, but don't hammer the better ones either; balance is always good
- Dicodon optimization. In addition to codons showing preferences, there's also a pattern by which adjacent codons pair slightly non-randomly. One particular example; very rare codons are very unlikely to be followed by another very rare codon. Logic: even better approach to "when in Rome..." than either of the two above
- Codon frequency matching. Roughly, this means look at the native mRNA and its uses of codons and ape this in the target species; a codon which is rare in the native should be replaced with one rare in the target. Logic: some rare codons may just help fold things properly
A related strategy worth mentioning are special expression strains which express extra copies of the rare tRNAs.
There is a lot of literature on codon optimization, and most of it suffers from the same flaw. Most papers describe taking one ORF, re-synthesizing it with a particular optimization scheme, and then comparing the two. One problem with this is the small N and the potential for publication bias (do people publish less frequently when this fails to work?). Furthermore, it could well be that the resynthesized design changed something else, and the codon optimization is really unimportant. A few papers deviate from this plan & there has been a hint from the structural genomics community of surveying their data (as they often codon optimized), but systematic studies aren't common.
Now in Science comes the sort of paper that starts to be systematic
Coding-Sequence Determinants of Gene Expression in Escherichia coli
Grzegorz Kudla, Andrew W. Murray, David Tollervey, and Joshua B. Plotkin
Science 10 April 2009: 255-258.
In short, they generated a library of GFP variants in which the particular codon used was varied randomly and then expressed these from a standard sort of expression vector in E.coli. The summary of their results is that codon usage didn't correlate with GFP brightness (expression), but that the key factor is avoidance of secondary structure near the beginning of the ORF.
It's a good approach, but a question is how general is the result. Is GFP a special protein in some way? Why do the rare tRNA-expressing strains sometimes help with protein expression? And most importantly, does this apply broadly or is it specific to E.coli and relatives?
This last point is important in the context of certain projects. E.coli and Saccharomyces have their codon preferences, but if you want to see an extreme preference, look at Streptomyces and its kin. These are important producers of antibiotics and other natural product medications, and it turns out that the codon usage table is easy to remember: just use G or C in the 3rd position. In one species I looked at, it was around 95% of all codons followed that rule.
This has the effect of making the G+C content of the entire ORF quite high, which engenders further problems. High G+C DNA can be difficult to assemble (or amplify) via PCR and it sequences badly. Furthermore, such a limited choice of codons means that anything resembling a repeat at the protein level will create a repeat at the DNA level, and even very short repeats can be problematic for gene synthesis. Long runs of G's can also be problematic for oligonucleotide synthesizers (or so I've been told). From a company's perspective, this is also a problem because customers don't really care about it and don't understand why you price some genes higher than others.
So, would the same strategy work in Streptomyces? If so, one could avoid synthesizing hyper-G+C genes and go with more balanced ones, reducing costs and the time to produce the genes. But, someone would need to make the leap and repeat Kudla et al strategy in some of these target organisms.