Omics! Omics!: Industrial Protein Production: Further Thoughts

Tuesday, September 15, 2009

Industrial Protein Production: Further Thoughts

A question raised by a commenter on yesterday's piece about codon optimization is how critical is this for the typical molecular biologist? I think for the typical bench biologist who is expressing small numbers of distinct proteins each year, perhaps the answer is "more critical than you think, but not project threatening". That is, if you are expressing few proteins only rarely will you encounter show-stopping expression problems. That said, with enough molecular biologists expressing enough proteins, some of them will have awful problems expressing some protein of critical import.

But, consider another situation: the high-throughput protein production lab. These can be found in many contexts. Perhaps the proteins are in a structural proteomics pipeline or similar large scale structure determination effort. Perhaps the proteins are to feed into high-throughput screens. Perhaps they are themselves the products for customers or are going into a protein array or similar multi-protein product. Or perhaps you are trying to express multiple proteins simultaneously to build some interesting new biological circuit.

Now, in some cases a few proteins expressing poorly isn't a big deal. The numbers for the project have a certain amount of attrition baked in, or for something like structural proteomics you can let some other protein which did express jump ahead in the queue. However, even with this the extra time and expense of troubleshooting the problem proteins, which can (as suggested by the commenter) be as simple as running multiple batches or can be as complex as screening multiple expression systems and strains, is time and effort that must be accounted for. However, sometimes the protein will be on a critical path and that extra time messes up someone's project plan. Perhaps the protein is the actual human target of your drug or the critical homolog for a structure study. Another nightmare scenario is that the statistics don't average out; for some project you're faced with a jackpot of poor expressors.

This in the end is the huge advantage of predictability; the rarer the unusual events, the smoother a high-throughput pipeline runs and the more reliable its output. So, from this point of view the advantage of the new codon optimization work is not necessarily that you can get huge amounts of proteins, but rather that the unpredictability is ironed out.

But suppose you wanted to go further? Given the enormous space of useful & interesting proteins to express, there will probably be some that become the outliers to the new process. How could you go further?

One approach would be to further tune the tRNA system of E.coli (or any other expression host). For example, there are already special E.coli strains which express some of the extremely disfavored E.coli tRNAs, and these seem to help expression when you can't codon optimize. In theory, it should be possible to create an E.coli with completely balanced tRNA expression. One approach to this would be analyze the promoters of the weak tRNAs and try to rev them up, mutagenizing them en masse with the MAGE technology published by the Church lab.

What else could you do? Expression strains carry all sorts of interesting mutations, often in things such as proteases which can chew up your protein product. There are, of course, all sorts of other standard cloning host mutations enhancing the stability of cloned inserts or providing useful other features. Other important modifications include such things as tightly controlled phage RNA polymerases locked into the host genome.

Another approach is the one commercialized by Scarab Genomics in which large chunks of E.coli have been tossed out. The logic behind this is that many of these deleted regions contain genetic elements which may interfere with stable cloning or genetic expression.

One challenge to the protein engineer or expressionist, however, is getting all the features they want in a single host strain. One strain may have desirable features X and Y but another Z. What is really needed is the technology to make any desirable combination of mutations and additions quickly and easily. The MAGE approach is one step in this direction but only addresses making small edits to a region.

One interesting use of MAGE would be to attempt to further optimize E.coli for high-level protein production. One approach would be to design a strain which already had some of the desired features. A further set of useful edits would be designed for the MAGE system. For a readout, I think GFP fused to something interesting would do -- but a set of such fusions would need to be ready to go. This is so evolved strains can quickly be counter-screened to assess how general an effect on protein production they have. If some of these tester plasmids had "poor" codon optimization schemes, then this would allow the tRNA improvement scheme described above to be implemented. Furthermore, it would be useful to have some of these tester constructs in compatible plasmid systems, so that two different test proteins (perhaps fused to different color variants of GFP) could be maintained simultaneously. This would be an even better way to initially screen for generality, and would provide the opportunity to perform the mirror-image screen for mutations which degrade foreign protein overexpression.

What would be targeted and how? The MAGE paper shows that ribosome binding sites can be a very productive way to tune expression, and so a simple approach would be for each targeted gene to have some strong RBS and weak RBS mutagenic oligos designed. For proteins thought to be very useful, MAGE oligos to tweak their promoters upwards would also be included. For proteins thought to be deleterious, complete nulls could be included via stop-codon introducing oligos. As far as the genes to target, the list could be quite large but would certainly include tRNAs, tRNA synthetases, all of the enzymes involved in the creation or consumption of amino acids, amino acid transporters. The RpoS gene and its targets, which are involved in the response to starvation, are clear candidates as well. Ideally one would target every gene, but that isn't quite in the scope of feasibility yet.

The screen then is to mutagenize via MAGE and select either dual-high (both reporters enhanced in brightness) or dual-low expressors (both reduced in brightness) by cell sorting. After secondary screens, the evolved strains would be fully sequenced to identify the mutations introduced both by design and by chance. Dual-high screens would pull out mutations that enhance expression whereas dual-low would pull out the opposite. Ideally these would be complementary -- genes knocked down in one would have enhancing mutations in the other.

Some of the mutations, particularly spontaneous ones, might be "trivial" in that they simply affect copy number of the expression plasmid. However, even these might be new insights into E.coli biology. And if multiple strains emerged with distinct mutations, a new round of MAGE could be used to attempt to combine them and determine if there are additive effects (or interferences).