Omics! Omics!: synthetic biology

Showing posts with label synthetic biology. Show all posts

Wednesday, February 23, 2011

Is Being Lucky the Same as Being Smart?

Forbes has an article co-written by Matthew Herper and Robert Langreth titled "The Next Big Move For The Smartest Biotech Investor", profiling Randal Kirk. Kirk is described as one of the few billionaires who can ascribe that status to biotechnology. Kirk made his money through two companies in the psychiatric drug space: New River Pharmaceuticals developed an ADHD drug (lisdexamfetamine) and then was acquired by Shire whereas Clinical Data developed an antidepressant (vilazodone) and was then purchased by Forest. A key part of the article profiles Kirk's investment in a little-known and secretive synthetic biology company called "Intrexon".

The title is probably meant to gall; it certainly raises my hackles. The most obvious quibble is that it isn't clear either of the drug development companies were really biotech. Of course, that would require defining biotech, but it ideally it would refer to companies which highly depend on recombinant DNA and related technologies. Now, such technologies are embedded in virtually all drug development today, but neither of these drugs sounds like they used much. Both drugs are interesting twists on prior approaches (though I'm not enough of a chemist to judge the novelty of vilazodone).

Will Cheap Gene Synthesis Squelch Cheaper Gene Synthesis?

Among the vast piles of items which I've meant to write about but have slipped are a paper last year on gene synthesis and some subsequent announcements about trying to commercialize the method described in that paper. This is an area in which I have past experience, though I would never claim this gives me indisputable authority or omniscience in the matter.

The paper, primarily by scientists from Febit but also with two scientists from Stanford and George Church in the author list, finally describes an interesting approach to dealing with some serious challenges in gene synthesis which substantially increase the costs. By finally, I mean that the idea has certainly been kicking around for a while and was mentioned when I visited Codon Devices in the fall of 2006 looking for employment.

To fill in some background first, gene synthesis is a powerful way to generate DNA constructs which can enable all sorts of experiments. The challenge is that the cost of gene synthesis, currently starting at around $0.40 per base pair for very easy and short stuff (say, less than 2Kb), tends to restrict what you can use it for. I have a project concept right now that would be a slam dunk for gene synthesis -- but not at $0.40/bp (which I think I couldn't even get for the project). Whack that price by a few factors of two and the project becomes reasonable.

There are many cost components to commercial gene synthesis, and only someone who has carefully looked over the books while wearing a green eyeshade is going to have a proper handle on them. But three of the big expenses are the oligos themselves, the sequencing of constructs to find the correct ones and labor. What the Febit paper does is illustrate a nice way to tackle the first two in a manner that shouldn't require a lot of labor.

The oligo cost is a serious issue. Conventional oligos can be had for around $0.08 or maybe a bit less a base. However, each base in the final construct requires close to 2 bases in the oligo set. Some design strategies might get this down a bit. However, conventional columns generate far more oligo than you actually need. An approach which has been published (but not commercialized as far as I know), is to scale down the synthesis using microfluidics. This method matches better the amount synthesized and the amount you need, though the length and quality of the oligos needs refinements from what was reported in order to be truly useful. Microarrays are a means to synthesize huge numbers of oligos, but their quality also tends to be low and the quantity of each oligo species is much too small without further amplification. Amplification schemes have been worked out, but add to the processing costs of the oligos.

What Febit and company have done is take those microarray-build oligos and screen them using 454 sequencing. The beads containing the amplicons with correct oligos are then plucked out of the 454 flowcell (with 90% success of getting the right bead) and used as starting points.

Now, this has several interesting angles. First, it has been challenging to marry the non-Sanger new sequencing technologies to gene synthesis. The new technologies tend to have short reads, too short to read even a short construct. The new technologies also require library construction and it is difficult to trace a given sequence back to a specific input DNA. In other words, short read technologies are great at reading populations, but not individual wells in a gene synthesis output. Sanger on the other hand, is ill-suited for populations but great for individual clones. One solution to this problem is clever pooling and barcoding strategies, but these necessitate having enough different clones to be worth pooling and barcoding. In other words, second generation sequencing is difficult to adapt to retail gene synthesis, but looks practical for wholesale gene synthesis.

Getting the oligos right has important positive side-effects. While the stitching together of oligos into larger fragments (and larger fragments into still larger ones) can generate errors, and awful lot of the problems stem from bad input oligos. Not only can error rates be troublesome, but some of the erroneous sequences may have advantages over the correct ones in later steps. For example, a deleted fragment may PCR more efficiently than the full length, and slightly toxic gene products may be disfavored in cloning steps over frameshifted versions of the same reading frame. So, by putting the sequencing up front it should be possible to reduce the later sequencing downstream. So even if that sequencing remains Sanger, it should be possible to do a lot less.

Okay, that's the science. Now some worries about the business. Febit announced in January they are looking for investors to fire off a new company to commercialize this approach. This makes good business sense, since Febit itself must be encrusted with all sorts of business barnacles, having lurched from one business to another in trying to commercialize their microfluidic microarray system. Previously failed attempts include gene synthesis as well as microarray expression analysis and hybridization capture (I even ran one experiment with their system, whose results certainly didn't argue for them staying in that business!). The press release stated they were hoping to attain pricing in the 0.08$ per base range, which would make my current experiment concept feasible. That would be great.

Now, they will need to refine their system and perhaps adapt other sequencers. A 454 Jr would probably not be a difficult adaptation, but moving on to Ion Torrent must be tempting. Getting things to work for one paper and one set of genes is unfortunately different than being able to keep things working over an entire spectrum of customer designs.

Which leads me to where I think they will have a great challenge, though one which I think can be finessed with the proper business approach. They will be brining to market a methodology whose benefit is cost at the expense but with the caveat of attaining that cost advantage only with sufficient volume. Initially, they will be unable to reliably predict delivery times (due to kinks showing up). Finally, they are adding some additional processing steps (454 sequencing, bead recovery & oligo recovery from the bead) which may add to the time.

The abyss into which this new company must plunge is a world in which very fast gene synthesis is available from a large number of vendors in the $0.40 price range. So, they must find very large customers who are willing to be a bit patient and keep their pipeline filled. Such customers do exist, but they aren't always easy to find and pry away from their existing suppliers. In theory much cheaper synthesis would unleash new orders for projects (such as mine) which are too costly at current prices, but that is always a risky assumption to bank a company on (c.f. Codon Devices' gene synthesis business).

It's the alternative route that I predict this NewCo is likely to go down. That would be to link up with an established provider in the field. Said provider, through their salespersons and sales software, could offer each customer an option -- I can build your genes for $0.40 if you want them fast or hack that down to $0.10 a base if you can wait. In order to preserve customer satisfaction, that long time would need to include an insurance period to build the genes by the conventional route if the new route fails -- but of course if you are frequently forced to build $0.40/bp genes for which you charged $0.10/bp, that would be financial suicide.

So, in summary, I think this is a clever idea which needs to be pushed forward. But, after a long gestation in the lab, it faces a very rocky future in the production world. I hope they succeed, because it is not hard to imagine projects I would like to do which would be enabled by such a capability.

Thursday, May 20, 2010

The New Genome on the Block

The world is abuzz with the announcement by Craig Venter and colleaguesthat they have successfully booted up a synthetic bacterial genome.

I need to really read the paper but I have skimmed it and spotted a few things. For example, this is a really impressive feat of gene synthesis but even so a mutation slipped in which went unnoticed until one version was tested. Even bugs need debuggers!

It is also a small but important step. Describing it as a man-made organism is in some ways true and some ways not. In particular, any die-hard vitalists (which nobody will admit to being, though there are clearly huge number of health food products sold using vitalist claims) will point out that there was never a time when there wasn't a living cell -- the new genome was started up within an old one.

It is fun to speculate about possible next directions. For example, they booted a new Mycoplasma genome within another Mycoplasma cell -- different species, but very similar to the host. Clearly one research direction will be to try to create increasingly different genomes. A related one is to try to bolt on entire new subsystems. A Japanese group tried fusing B.subtilis (a heavily studied soil bug) with a cyanobacterium to see if they could build a hybrid which retained the photosynthetic capabilities of the cyano; alas they got only sickly hybrids that didn't do much of interest. Could you add in photosynthesis to the new bug? Or a bacterial flagellum? Or some other really complex more-than-just-coupled-enzymes subsystem?

But as someone with a computer background -- and someone who has thought off-and-on about this topic since graduate school (mostly off, to be honest), to me a really interesting demonstration would be a dual-boot genome. Again, in this case the two bacterial species were very similar, so their major operational signals are the same. Consider two of the most important systems which do vary widely from bacterial clade to clade (the genetic code is, of course, near universal -- though Mycoplasma do have an idiosyncratic variation on the code): promoters and ribosome binding sites. Could you build the second genome to use a completely incompatible set of one of these (later both) and successfully boot it? Clearly what you would need is for the host genome -- or an auxillary plasmid -- to supply the necessary factors. Probably the easier one would be to have the synthetic genome use the ribosomal signals of the host but a different promoter scheme. In theory just expressing the sigma factor for those promoters would be sufficient -- but would it be? To me this would be a fascinating exercise!

Now, I did claim dual-boot. A true dual-boot system could use both. That is much trickier, but particularly on the transcriptional side it is somewhat plausible -- just arrange the two promoters in tandem. Ribosome binding sites would need to be hybrids, which isn't as striking a change.

There are even more outlandish proposals floating out there -- synthetic bugs with very different genetic codes (perhaps even non-triplet codes) or the ultimate synthetic beast -- one with the reverse handedness to all its chiral molecules. Those are clearly a long ways off, but today's announcement is another step in these directions.

Sunday, January 10, 2010

There's Plenty of Room at the Bottom

Friday's Wall Street Journal had a piece in the back opinion section (which has items about culture & religion and similar stuff) discussing Richard Feynman's famous 1959 talk "There's Plenty of Room at the Bottom". This talk is frequently cited as a seminal moment -- perhaps the first proposition -- of nanotechnology. But, it turns out that when surveyed many practitioners in the field claim not to have been influenced by it and often to have never read it. The article pretty much concludes that Feynman's role in the field is mostly promoted by those who promote the field and extreme visions of it.

Now, by coincidence I'm in the middle of a Feynman kick. I first encountered him in the summer of 1985 when his "as told to" book "Surely You're Joking Mr. Feynman" was my hammock reading. The next year he would become a truly national figure with his carefully planned science demonstration as part of the Challenger disaster commission. Other than recently watching Infinity, which focuses around his doomed marriage (his wife would die of TB) & the Manhattan project. Somehow, that pushed me to finally read James Gleick's biography "Genius" and now I'm crunching through "Six Easy Pieces" (a book based largely on Feynman's famous physics lecture set for undergraduates), with the actual lectures checked out as well for stuffing on my audio player. I'll burn out soon (this is a common pattern), but will gain much from it.

I had never actually read the talk before, just summaries in the various books, but luckily it is available on-line -- and makes great reading. Feynman gave the talk at the American Physical Society meeting, and apparently nobody knew what he would say -- some thought the talk would be about the physics job market! Instead, he sketched out a lot of crazy ideas that nobody had proposed before -- how small a machine could one build? How tiny could you write? Could you make small machines which could make even smaller machines and so on and so forth? He even put up two $1000 prizes:

It is my intention to offer a prize of $1,000 to the first guy who can take the information on the page of a book and put it on an area 1/25,000 smaller in linear scale in such manner that it can be read by an electron microscope.

And I want to offer another prize---if I can figure out how to phrase it so that I don't get into a mess of arguments about definitions---of another $1,000 to the first guy who makes an operating electric motor---a rotating electric motor which can be controlled from the outside and, not counting the lead-in wires, is only 1/64 inch cube.

The first prize wasn't claimed until the 1980's, but a string of cranks streamed in to claim the second one -- bringing in various toy motors. Gleick describes Feynman's eyes as "glazed over" when yet another person came in to claim the motor prize -- and an "uh oh" when the guy pulled out a microscope. It turned out that by very patient work it was possible to use very conventional technology to wind a motor that small -- and Feynman hadn't actually set aside money for the prize!

Feynman's relationship to nanotechnology is reminiscent of Mendel's to genetics. Mendel did amazing work, decades ahead of his time. He documented things carefully, but his publication strategy (a combination of obscure regional journals and sending his works to various libraries & famous scientists) failed in his lifetime. Only after three different groups rediscovered his work -- after finding much the same results -- was Mendel started on the road to scientific iconhood. Clearly, Mendel did not influence those who rediscovered him and if his work were still buried in rare book rooms, we would have a similar understanding of genetics to what we have today. Yet, we refer to genetics as "Mendelian" (and "non-Mendelian").

I hope nanotechnologists give Feynman a similar respect. Perhaps some of the terms describing his role are hyperbole ("spiritual founder"), but he clearly articulated both some of the challenges that would be encountered (for example, that issues of lubrication & friction at these scales would be quite different) and why we needed to address them. For example, he pointed out that the computer technology of the day (vacuum tubes) would place inherent performance limits on computers -- simply because the speed of light would limit the speed of information transfer across a macroscopic computer complex. He also pointed out that the then-current transistor technology looked like a dead end, as the entire world's supply of germanium would be insufficient. But, unlike naysayers he pointed out that these were problems to solve, and that he didn't know if they really would be problems.

One last thought -- many of the proponents of synthetic biology point out that biology has come up with wonderfully compact machines that we should either copy or harness. And who first articulated this concept? I don't know for sure, but I now propose that 1959 is the year to beat

The biological example of writing information on a small scale has inspired me to think of something that should be possible. Biology is not simply writing information; it is doing something about it. A biological system can be exceedingly small. Many of the cells are very tiny, but they are very active; they manufacture various substances; they walk around; they wiggle; and they do all kinds of marvelous things---all on a very small scale. Also, they store information. Consider the possibility that we too can make a thing very small which does what we want---that we can manufacture an object that maneuvers at that level!

So if the nanotechnologists don't want to call their field Feynmanian, I propose that synthetic biology be renamed such!

Tuesday, September 15, 2009

Industrial Protein Production: Further Thoughts

A question raised by a commenter on yesterday's piece about codon optimization is how critical is this for the typical molecular biologist? I think for the typical bench biologist who is expressing small numbers of distinct proteins each year, perhaps the answer is "more critical than you think, but not project threatening". That is, if you are expressing few proteins only rarely will you encounter show-stopping expression problems. That said, with enough molecular biologists expressing enough proteins, some of them will have awful problems expressing some protein of critical import.

But, consider another situation: the high-throughput protein production lab. These can be found in many contexts. Perhaps the proteins are in a structural proteomics pipeline or similar large scale structure determination effort. Perhaps the proteins are to feed into high-throughput screens. Perhaps they are themselves the products for customers or are going into a protein array or similar multi-protein product. Or perhaps you are trying to express multiple proteins simultaneously to build some interesting new biological circuit.

Now, in some cases a few proteins expressing poorly isn't a big deal. The numbers for the project have a certain amount of attrition baked in, or for something like structural proteomics you can let some other protein which did express jump ahead in the queue. However, even with this the extra time and expense of troubleshooting the problem proteins, which can (as suggested by the commenter) be as simple as running multiple batches or can be as complex as screening multiple expression systems and strains, is time and effort that must be accounted for. However, sometimes the protein will be on a critical path and that extra time messes up someone's project plan. Perhaps the protein is the actual human target of your drug or the critical homolog for a structure study. Another nightmare scenario is that the statistics don't average out; for some project you're faced with a jackpot of poor expressors.

This in the end is the huge advantage of predictability; the rarer the unusual events, the smoother a high-throughput pipeline runs and the more reliable its output. So, from this point of view the advantage of the new codon optimization work is not necessarily that you can get huge amounts of proteins, but rather that the unpredictability is ironed out.

But suppose you wanted to go further? Given the enormous space of useful & interesting proteins to express, there will probably be some that become the outliers to the new process. How could you go further?

One approach would be to further tune the tRNA system of E.coli (or any other expression host). For example, there are already special E.coli strains which express some of the extremely disfavored E.coli tRNAs, and these seem to help expression when you can't codon optimize. In theory, it should be possible to create an E.coli with completely balanced tRNA expression. One approach to this would be analyze the promoters of the weak tRNAs and try to rev them up, mutagenizing them en masse with the MAGE technology published by the Church lab.

What else could you do? Expression strains carry all sorts of interesting mutations, often in things such as proteases which can chew up your protein product. There are, of course, all sorts of other standard cloning host mutations enhancing the stability of cloned inserts or providing useful other features. Other important modifications include such things as tightly controlled phage RNA polymerases locked into the host genome.

Another approach is the one commercialized by Scarab Genomics in which large chunks of E.coli have been tossed out. The logic behind this is that many of these deleted regions contain genetic elements which may interfere with stable cloning or genetic expression.

One challenge to the protein engineer or expressionist, however, is getting all the features they want in a single host strain. One strain may have desirable features X and Y but another Z. What is really needed is the technology to make any desirable combination of mutations and additions quickly and easily. The MAGE approach is one step in this direction but only addresses making small edits to a region.

One interesting use of MAGE would be to attempt to further optimize E.coli for high-level protein production. One approach would be to design a strain which already had some of the desired features. A further set of useful edits would be designed for the MAGE system. For a readout, I think GFP fused to something interesting would do -- but a set of such fusions would need to be ready to go. This is so evolved strains can quickly be counter-screened to assess how general an effect on protein production they have. If some of these tester plasmids had "poor" codon optimization schemes, then this would allow the tRNA improvement scheme described above to be implemented. Furthermore, it would be useful to have some of these tester constructs in compatible plasmid systems, so that two different test proteins (perhaps fused to different color variants of GFP) could be maintained simultaneously. This would be an even better way to initially screen for generality, and would provide the opportunity to perform the mirror-image screen for mutations which degrade foreign protein overexpression.

What would be targeted and how? The MAGE paper shows that ribosome binding sites can be a very productive way to tune expression, and so a simple approach would be for each targeted gene to have some strong RBS and weak RBS mutagenic oligos designed. For proteins thought to be very useful, MAGE oligos to tweak their promoters upwards would also be included. For proteins thought to be deleterious, complete nulls could be included via stop-codon introducing oligos. As far as the genes to target, the list could be quite large but would certainly include tRNAs, tRNA synthetases, all of the enzymes involved in the creation or consumption of amino acids, amino acid transporters. The RpoS gene and its targets, which are involved in the response to starvation, are clear candidates as well. Ideally one would target every gene, but that isn't quite in the scope of feasibility yet.

The screen then is to mutagenize via MAGE and select either dual-high (both reporters enhanced in brightness) or dual-low expressors (both reduced in brightness) by cell sorting. After secondary screens, the evolved strains would be fully sequenced to identify the mutations introduced both by design and by chance. Dual-high screens would pull out mutations that enhance expression whereas dual-low would pull out the opposite. Ideally these would be complementary -- genes knocked down in one would have enhancing mutations in the other.

Some of the mutations, particularly spontaneous ones, might be "trivial" in that they simply affect copy number of the expression plasmid. However, even these might be new insights into E.coli biology. And if multiple strains emerged with distinct mutations, a new round of MAGE could be used to attempt to combine them and determine if there are additive effects (or interferences).

Monday, September 14, 2009

Codon Optimization is Not Bunk?

In a previous post I asked "Is Codon Optimization Bunk?", reflecting on a paper which showed that the typical rules for codon optimization appeared not to be highly predictive of the expression of GFP constructs. A paper released in PLoS One sheds new light on this question.

A quick review. To the first approximation, the genetic code consists of 3 nucleotide units called codon; there are 64 possible codons. Twenty amino acids plus stop are specified by these codons (again, 1st approximation). So, either a lot of codons are never used or at least some codons mean the same thing. In the coding system used by the vast majority of organisms, two amino acids are encoded with a single codon whereas all the others have 2, 3, 4 or 6 codons apiece (and stop gets 3). For amino acids with 2, 3 or 4 codons, it is the third position that makes the difference; for the three that have 6, they have one block of 4 which follows this pattern and one set of two which also differ from each other in the third position. For two amino acids with 6 codons, the two groups are next to each other so that you can think of the change between the blocks as a change in the second position; Ser is very strange in that the two blocks of codons are terribly like each other. For amino acids with two codons, the 3rd position is either a purine (A,G) or pyrimidine (C,T). For a given amino acid, these codons are not used equally by a given organism; the pattern of bias in codon usage is quite distinct for an organism and its close cousins and this has wide effects on the genome (and vice versa). For example, in some Streptomyces I have the codon bias pattern pretty much memorized: use G or C in the third position and you'll almost always pick a frequent codon; use A or T and you've picked a rarity. Some other organisms skew much the reverse; they like A or T in the third position.

Furthermore, within a species the genes can even be divided further. In E.coli, for example, there are roughly three classes of genes each with a distinctive codon usage signature. One class is rich in important proteins which the cell probably needs a lot of, the second class seems to have many proteins which see only a soupcon of expression and the third class is rich in proteins likely to have been recently acquired from other species.

So, it was natural to infer that this mattered for protein expression. In particular, if you try to express a protein from one species in another. Some species seemed to care more than others. E.coli has a reputation for being finicky and had one of the best studied systems. Not only did changing the codon usage over to a more E.coli system seem to help some proteins, but truly rare codons (used less than 5% of the time, though that is an arbitrary threshold) could cause all sorts of trouble.

However, the question remained how to optimize. Given all those interchangeable codons, a synthetic gene could have many permutations. Several major camps emerged with many variants, particularly amongst the gene synthesis companies. One school of thought said "maximize, maximize, maximize" -- pick the most frequently used codons in the target species. A second school said "context matters" -- and went to maximize the codon pair useage. A third school said "match the source!", meaning make the codon usage of the new coding sequence in the new species resemble the codon usage of the old coding region in the old species. This hedged for possible requirements for rare codons to ensure proper folding. Yet another school (which I belonged to) urged "balance", and chose to make the new coding region resemble a "typical" target species gene by sampling the codons based on their frequencies, throwing out the truly rare ones. A logic here is that hammering the same codon -- and thereby the same tRNA -- over and over would make that codon as good as rare.

The new work has some crumbs for many of these camps but not many; it suggests much was wrong with each -- or perhaps, the same thing was wrong with each. The problem is that even with these systems some proteins just didn't express well, leaving everyone scratching their heads. The GFP work seemed to suggest that the effects of codon usage were unpredictable if present, and in any case other factors, such as secondary structure near the ribosome, were what counted.

What the new work did is synthesize a modest number (40) of versions of two very different proteins (a single-chain antibody and an enzyme, each version specifying the same protein sequence but with a different set of codons. Within each type of protein, the expression varied over two logs; clearly something matters. Furthermore, they divided some of the best and worst expressors into thirds and made chimaeras, head of good and tail of bad (and vice versa). Some chimaeras seemed to have expression resembling their parent for the head end but others seemed to inherit from the tail end parent. So the GFP-based "ribosome binding site neighborhood secondary structure matters" hypothesis did not fare well with these tests.

After some computational slicing-and-dicing, what they did come up with is that codon usage matters. The twist is that it isn't matching the best used codons (CAI) that's important, as shown in the figure at the top which I'm fair-using. The codons that matter aren't necessarily the most used codons, but when cross-referenced with some data on which codons are most sensitive to starvation conditions the jackpot lights come on. When you use these as your guide, as shown below, the predictive ability is quite striking. In retrospect, this makes total sense: expressing a single protein at very high levels is probably going to deplete a number of amino acids. Indeed, this was the logic of the sampling approach. But, I don't believe any proponent of that approach ever predicted this.

Furthermore, not only does this work on the training set but new coding regions were prepared to test the model, and these new versions had expression levels consistent with the new model.

What of secondary structure near the ribosome? In some of the single-chain antibody constructs an effect could be seen, but it appears the codon usage effect is dominant. In conversations with the authors (more on this below), they mentioned that GFP is easy to code with secondary structure near the ribosome binding site; this is just an interesting interaction of the genetic code with the amino acid sequence of GFP. Since it is easy in this case to stumble on secondary structure, that effect shows up in that dataset.

This is all very interesting, but it is also practical. On the pure biology side, it does suggest that studying starvation is applicable to studying high level protein expression, which should enable further studies on this important problem. On the protein expression side, it suggests a new approach to optimizing expression of synthetic constructs. A catch however: this work was run by DNA2.0 and they have filed for patents and at least some of these patents have issued (e.g. US 7561972 and US 7561973). I mention this only to note that it is so and to give some starting points for reading further; clearly I have neither the expertise nor responsibility to interpret the legal meaning of patents.

Which brings us to one final note: this paper represents my first embargo! A representative of DNA2.0 contacted me back when my "bunk" post was written to mention that this work was going to emerge, and finally last week the curtain was lifted. Obviously they know how to keep a geek in suspense! They sent me the manuscript and engaged in a teleconference with the only proviso being that I continued to keep silent until the paper issued. I'm not sure I would have caught this paper otherwise, so I'm glad they alerted me; though clearly both the paper and this post are not bad press for DNA2.0. Good luck to them! Now that I'm on the other side of the fence, I'll buy my synthetic genes from anyone with a good price and a good design rationale.

Mark Welch, Sridhar Govindarajan, Jon E. Ness, Alan Villalobos, Austin Gurney, Jeremy Minshull1, Claes Gustafsson (2009). Design Parameters to Control Synthetic Gene Expression in Escherichia coli PLoS One, 4 (9) : 10.1371/journal.pone.0007002

Tuesday, April 21, 2009

Is Codon Optimization Bunk?

There is a very interesting paper in Science from a week ago which hearkens back to my gene synthesis days at Codon. But first, some background.

The genetic code (at first approximation) uses 64 codons to encode 21 different signals; hence there are some choices as to which codon to use. Amino acids and stop can have 1,2,3,4 or 6 codons in the standard scheme of things. But, those codons are rarely used with equal frequency. Leucine, for example, has 6 codons and some are rarely used and others often. Which codons are preferred and disfavored, and the degree to which this is true, depends on the organism. In the extreme, a codon can actually go so out of favor it goes extinct & can no longer be used, and sometimes it is later reassigned to something else; hence some of the more tidy codes in certain organisms.

A further observation is that the more favored codons correspond to more abundant tRNAs and less favored ones to less abundant tRNAs. Furthermore, highly expressed genes are often rich in favored codons and lowly expressed ones much more likely to use rare ones. To complete the picture, in organisms such as E.coli there are genes which don't seem to follow the usual pattern -- and these are often associated with mobile elements and phage or have other suggestions that they may be recent acquisitions from another species.

A practical application of this is to codon optimize genes. If you are having a gene built to express a protein in a foreign host, then it would seem apropos to adjust the codon usage to the local dialect, which usually still leaves plenty of room to accommodate other wishes (such as avoiding the recognition sites for specific restriction enzymes). There are at least four major schemes for doing this, with different gene synthesis vendors preferring one or the other

CAI Maximization. CAI is a measure of usage of preferred codons; this strategy tries to maximize the statistic by using the most preferred codons. Logic: if these are the most preferred codons, and highly expressed genes are rich in them, why not do the same?

Codon sampling. This strategy (which is what Codon Devices offered) samples from a set of codons with probabilities proportional to their usage in the organism, after first zeroing out the very rare codons and renormalizing the table. Logic: avoid the rare ones, but don't hammer the better ones either; balance is always good

Dicodon optimization. In addition to codons showing preferences, there's also a pattern by which adjacent codons pair slightly non-randomly. One particular example; very rare codons are very unlikely to be followed by another very rare codon. Logic: even better approach to "when in Rome..." than either of the two above

Codon frequency matching. Roughly, this means look at the native mRNA and its uses of codons and ape this in the target species; a codon which is rare in the native should be replaced with one rare in the target. Logic: some rare codons may just help fold things properly

A related strategy worth mentioning are special expression strains which express extra copies of the rare tRNAs.

There is a lot of literature on codon optimization, and most of it suffers from the same flaw. Most papers describe taking one ORF, re-synthesizing it with a particular optimization scheme, and then comparing the two. One problem with this is the small N and the potential for publication bias (do people publish less frequently when this fails to work?). Furthermore, it could well be that the resynthesized design changed something else, and the codon optimization is really unimportant. A few papers deviate from this plan & there has been a hint from the structural genomics community of surveying their data (as they often codon optimized), but systematic studies aren't common.

Now in Science comes the sort of paper that starts to be systematic

Coding-Sequence Determinants of Gene Expression in Escherichia coli
Grzegorz Kudla, Andrew W. Murray, David Tollervey, and Joshua B. Plotkin
Science 10 April 2009: 255-258.

In short, they generated a library of GFP variants in which the particular codon used was varied randomly and then expressed these from a standard sort of expression vector in E.coli. The summary of their results is that codon usage didn't correlate with GFP brightness (expression), but that the key factor is avoidance of secondary structure near the beginning of the ORF.

It's a good approach, but a question is how general is the result. Is GFP a special protein in some way? Why do the rare tRNA-expressing strains sometimes help with protein expression? And most importantly, does this apply broadly or is it specific to E.coli and relatives?

This last point is important in the context of certain projects. E.coli and Saccharomyces have their codon preferences, but if you want to see an extreme preference, look at Streptomyces and its kin. These are important producers of antibiotics and other natural product medications, and it turns out that the codon usage table is easy to remember: just use G or C in the 3rd position. In one species I looked at, it was around 95% of all codons followed that rule.

This has the effect of making the G+C content of the entire ORF quite high, which engenders further problems. High G+C DNA can be difficult to assemble (or amplify) via PCR and it sequences badly. Furthermore, such a limited choice of codons means that anything resembling a repeat at the protein level will create a repeat at the DNA level, and even very short repeats can be problematic for gene synthesis. Long runs of G's can also be problematic for oligonucleotide synthesizers (or so I've been told). From a company's perspective, this is also a problem because customers don't really care about it and don't understand why you price some genes higher than others.

So, would the same strategy work in Streptomyces? If so, one could avoid synthesizing hyper-G+C genes and go with more balanced ones, reducing costs and the time to produce the genes. But, someone would need to make the leap and repeat Kudla et al strategy in some of these target organisms.

Wednesday, July 30, 2008

Paring pair frequencies pares virus aggressiveness

Okay, a bit late with this as it came out in Science about a month ago, but it's a cool paper & illustrates a number of issues I've dealt with at my current shop.. Also, in the small world department one of the co-author's was my eldest brother's roommate at one point.

The genetic code uses 64 codons to code for 21 different symbols -- 20 amino acids plus stop. Early on this was recognized as implying that either (a) some codons are simply not used or (b) many symbols have multiple, synonymous codons, which turns out to be the case (except in a few species, such as Micrococcus luteus, which have lost the ability to translate certain codons).

Early on in the sequencing era (certainly before I jumped in) it was noted that not all synonymous codons are used equally. These patterns, or codon bias, were specific to specific taxa. The codon usage of Escherichia is different from that of Streptomyces. Furthermore, it was noted that there is a signal in pairs of successive codons; that is that the frequency of a given codon pair is often not simply the product of the two codon's individual frequencies. This was (and is) one of the key signals which gene finding programs use to hunt for coding regions in novel DNA sequences.

Codon bias can be mild or it can be severe. Earlier this year I found myself staring at a starkly simple codon usage pattern: C or G in the 3rd position. In many cases the C+G codons for an amino acid had >95% of the usage. For both building & sequencing genes this has a nasty side-effect: the genes are very GC rich, which is not good (higher melting temp, all sorts of secondary structure options, etc).

Another key discovery is that codon usage often correlates with protein abundance; the most abundant proteins show the greatest hewing to the species-specific codon bias pattern. It further turned out that highly used codons tend to be most abundant in the cell, suggesting that frequent codons optimize expression. Furthermore, it could be shown that in many cases rare codons could interfere with translation. Hence, if you take a gene from organism X and try to express it in E.coli, it would frequently translate poorly unless you recoded the rare codons out of it. Alternatively, expressing additional copies of the tRNAs matching rare codons could also boost expression.

Now, in the highly competitive world of gene synthesis this was (and is) viewed as a selling point: building a gene is better than copying it as it can be optimized for expression. Various algorithms for optimization exist. For example, one company optimizes for dicodons. Many favor the most common codons and use the remainder only to avoid undesired sequences. Locally we use codons with a probability proportional to their usage (after zeroing out the 'rare' codons). Which algorithm is best? Of course, I'm not impartial, but the real truth is there isn't any systematic comparison out there, nor is there likely to be one given the difficulty of doing the experiment well and the lack of excitement in the subject.

Besides the rarity of codons affecting translation levels, how else might synonymous codons not be synonymous? The most obvious is that synonymous codons may sometimes have other signals layered on them -- that 'free' nucleotide may be fixed for some other reason. A more striking example, oft postulated but difficult to prove, is that rare codons (especially clusters of them) may be important for slowing the ribosome down and giving the protein a chance to fold. In one striking example, changing a synonymous codon can change the substrate specificity of a protein.

What came out in Science is using codon rewriting, enabled by synthetic biology, on a grand scale. Live virus vaccines are just that: live, but attenuated, versions of the real thing. They have a number of advantages (such as being able to jump from one vaccinated person to an unvaccinated one), but the catch is that attenuation is due to a small number of mutations. Should these mutations revert, pathogenicity is restored. So, if there was a way to make a large number of mutations of small effect in a virus, then the probability of reversion would be low but the sum of all those small changes would be attenuation of the virus. And that's what the Science authors have done.

Taking poliovirus they have recoded the protein coding regions to emphasize rare (in human) codon pairs (PV-Min). They did this while preserving certain other known key features, such as secondary structures and overall folding energy. A second mutant was made that emphasized very common codon pairs (PV-Max). In both cases, more than 500 synonymous mutations were made relative to wild polio. Two further viruses were built by subcloning pieces of the synthetic viruses into a wildtype background.

Did this really do anything? Well, their PV-Max had similar in vitro characteristics to wild virus, whereas PV-Min was quite docile, failing to make plaques or kill cells. Indeed, it couldn't be cultured in cells.

The part-Min part wt chimaeras also showed severe defects and some also couldn't be propagated as viruses. However, one containing two segments of engineered low-frequency codon pairs, called PV-MinXY, could but was greatly attenuated. While its ability to make virions was slightly attenuated (perhaps one tenth the number), more strikingly about 100X the number of virions was required for a successful infection. Repeated passaging of PV-MinXY and another chimaera failed to alter the infectivity of the viruses; the attenuation stability through a plethora of small mutations strategy appears to work.

When my company was trying to sell customers on the value of codon optimization, one frustration for me as a scientist was the paucity of really good studies showing how big an effect it could have. Most studies in the field are poorly done with too few controls and only a protein or two. Clearly there is a signal, but it was always hard to really say "yes, it can have huge effects". Clearly in this study of codon optimization writ large, codon choice has enormous effects.

Tuesday, July 29, 2008

The youngest DNA author?

Earlier this year an interesting opportunity presented itself at the DNA foundry where I am employed. For an internal project we needed to design 4 stuffers. Stuffers are the stuff of creative opportunity!

A stuffer is a segment of DNA whose only purpose is to take up space. Most commonly, some sort of vector is to be prepared by digesting with two restriction enzymes and the correct piece then purified by gel electrophoretic separation and then manual cutting from the gel. If you really need a double digestion then the stuffer is important so that single digestion products are resolvable from the desired product; the size of the stuffer causes single digests to run at a discernibly different position.

Now, we could have made all 4 stuffers nearly the same, but there wasn't any significant cost advantage and where's the fun in that? We did need to make sure this particular stuffer contained stop codons guarding its frontiers (to prevent any expression of or through the stuffer), that it possess the key restriction sites and that it lack a host of other sites possibly used in manipulating the vector. It also needed to be easily synthesizable and verified by Sanger sequencing -- no runs of 100 As for example. But beyond that, it really didn't matter what went in.

So I whipped together some code to translate short messages written in the amino acid code (obeying the restriction site constraints) and wrap that message into the scaffold framework. And I started cooking up messages or words to embed. One stuffer contains a fragment of my post last year which obeyed the amino acid code (the first blog-in-DNA?); another celebrates the "Dark Lady of DNA". Yet another has the beginning of the Gettysburg Address, with 'illegal' letters just dropped. Some other candidates were considered and parked for future use: The opening phrase to a great work of literature ("That Sam I am, That Sam I am" -- the title also work!), a paen to my wagging companion,.

But the real excitement came when I realized I could subcontract the work out. My code did all the hard work, and another layer of code by someone else would apply another round of checks. The stuffer would never leave the lab, so there was no real safety concern. So I offered the challenge to The Next Generation and he accepted.

He quickly adapted to the 'drop illegal letters' strategy and wrote his own short ode to his favorite cartoon character, a certain glum tentacled cashier. I would have let him do more, but creative writing's not really his preferred activity & the novelty wore off. But, his one design was captured and was soon spun into oligonucleotides, which were in turn woven into the final construct.

So, at the tender age of 8 and a few months the fruit of my chromosomes has inscribed a message in nucleotides. For a moment, I will claim he is the youngest to do so. Of course, making such a claim publicly is the sure recipe to destroying it, as either someone will come forward with a tale of their toddler flipping the valves on their DNA synthesizer or will just be inspired to have their offspring design a genome (we didn't have the budget!).

And yes, at some future date we'll sit down and discuss the ethics of the whole affair. How his father made sure that his DNA would be inert (my son, the pseudogene engineer!) and what would need to be considered if this DNA were to be contemplated for environmental release. We might even get into even stickier topics, such as the propriety of wheedling your child to provide free consulting work!

Monday, August 06, 2007

Pre-WWW Hyperlinking

I recently attempted to rhapsodize on the wonders of restriction endonucleases. My exploration of this area has also reacquainted me with an amazing invention, what I might argue is the first artifact of what we now call synthetic biology.

An important early use, still going strong, for restriction enzymes is the cutting-and-pasting of DNA sequences. An early vector which was heavily used was pBR322, and it was also one of the first DNA molecules to have its entire sequence determined. pBR322 was particularly useful because for certain popular restriction enzymes it contained only a single site and that site was not in a critical region. This facilitated cloning into that site.

However, only a few restriction enzymes fit this description. In addition, a common problem with cloning into plasmids was that of empty vector, in which the plasmid reseals without capturing a DNA of interest. A clever scheme emerged somewhere of cloning into a portion (the alpha peptide) of E.coli beta-galactosidase; if the plasmid captured an insert then beta-Gal function would be disrupted. This loss-of-function would show up as white colonies when the E.coli were grown on media containing synthetic compounds that turn blue when cleaved by beta-Gal.

It turns out that this alpha peptide will accept a significant insertion of amino acids, and somewhere the germ of the idea of a polylinker emerged. The polylinker would contain many unique restriction sites and also enable blue-white cloning. For what I believe is the first time, a human sat down and designed a specific & novel DNA sequence for a specific & novel purpose and had it synthesized. Previous DNA synthesis efforts, such as the original effort by Har Gobind Khorana to make a tRNA or the synthesis of an artificial human hormone gene at UCSF, were intended to make something already extant in nature. The first polylinker was perhaps the first creative work of DNA!

That original polylinker had a mirror-symmetry and just 4 cloning sites, with the fold preventing using pairs of sites. Not long afterwards came the pUC polylinkers, which have each site represented only once and a very dense packing of sites. These have been propagated to many other vectors.

I've seen other polylinkers, but none seem to have the popularity of the pUC polylinkers. Shown is the pUC18 polylinker; one additional twist is that this sequence reads through (no stop codons) in either direction; pUC19 simply has the polylinker in the opposite orientation.


CAAGCTTGCATGCCTGCAGGTCGACTCTAGAGGATCCCCGGGTACCGAGCTCGAATTCGT

Two pedagogic angles occur to me. For any biology class, it would be fun to follow-up the session on restriction enzymes by handing each student the pUC polylinker sequence. The assignment is to find as many six or eight basepair palindromes as possible. The other interesting assignment would be for an advanced bioinformatics class: write a program to take a set of restriction enzymes and build a polylinker with them, with shorter outputs scoring higher and bidirectionality scoring higher. Such an exercise will really underline the achievement of the pUC design, which I believe was done with pencil-and-paper, not by computer program.

Monday, February 05, 2007

Catching some rays

Alas, due to a disagreement between myself & Blogger as to how to timestamp things (and Blogger won), my initial post for the Week of Science has so far failed to show up there, despite my attempts to unstick it. From lemons shall come lemonade, in this case an extra post.

A key goal of synthetic biology is to transfer completely novel traits to organisms; doing so demonstrates an understanding of the trait and perhaps might be useful for something, though the latter is hardly essential.

Last week's PNAS advance publication site has a neat item (alas, requiring subscription for the text) showing how simple it can all be. By adding a single protein to E.coli the biology of E.coli is changed in a radical way.

There is something uniquely pleasurable about sitting in a sunbeam, since the plain glass window takes out the nasty UV. The Omics! Omics! director of ergonomics (responsible for preventing repetitive stress injuries by ensuring regular breaks from the keyboard) certainly likes a good sunbeam. But while the sunbeam provides me warmth and pleasure, it can't feed me.

E.coli is normally the same way; it knows not what to do with sunlight. But by stealing one little gene from another microorganism, Carlos Bustamante's group has changed that. Certain single celled organisms called archeans use light to drive a pump protein called bacteriorhodopsin. The pumping action creates a gradient of hydrogen ions, a gradient which can drive useful work. By moving bacteriorhodopsin to E.coli, the bacterium acquires the ability to generate energy from light of the correct wavelength. Indeed, because a respiratory poison was present the bacteria are now dependent on light for any energy production -- and since E.coli has a flagellum which can propel it, this energy production can be observed as light-dependent swimming.

As mentioned before, the paper requires a subscription, but the supplementary material does not. You can watch a movie of a tethered E.coli responding to light, with the false coloring indicating the wavelength being shown (red for stop, green for go of course!). Furthermore, the individual tethered cells can be treated like little machines and the forces they generate measured, from this important details of the molecular machinery can be worked out. Showy, yet practical!

[trying to push this through the Week of Science system too]

Wednesday, November 29, 2006

Phage Renaissance

Bacteriophage, or phage, occupy an exalted place in the history of modern biology. Hershey & Chase used phage to nail down DNA (and not protein) as the genetic material. Benzer pushed genetic mapping to the nucleotide level. And much, much more. Phage could be made in huge numbers, to scan for rare events. Great stuff, and even better that so many of the classic papers are freely available online!

Phage have also been great toolkits for molecular biology. First, various enzymes were purified, many still in use today. Later, whole phage machinery were borrowed to move DNA segments around.

Two of the best studied phage are T7 and lambda. Both have a lot of great history, and both have recently undergone very interesting makeovers.

T7 is a lytic phage; after infection it simply starts multiplying and soon lyses (breaks open) its host. T7 provided an interesting early computational conumdrum, one which I believe is still unsolved. Tom Schneider has an elegant theory about information and molecular biology, which can be summarized as locational codes contain only as much information as they need to be located uniquely in a genome, no more, no less. Testing on a number of promoters suggested the theory valid. However, a sore thumb stuck out: T7 promoters contain far more information than the theory called for, and a clever early artificial evolution approach showed that this information really wasn't needed by T7 RNA polymerase. So why is there more conservation than 'necessary'? It's still a mystery.

Phage lambda follows a very different lifestyle. After infection, most times it goes under deep cover, embedding itself at a single location in its E.coli host's genome, a state called lysogeny. But when the going gets tight, the phage get going and go through a lytic phase much like that of T7. The molecular circuitry responsible for this bistable system was one of the first complex genetic systems elucidated in detail. Mark Ptashne's book on this, A Genetic Switch, should be part of the Western canon -- if you haven't read it, go do so! (Amazon link)

With classical molecular biology techniques, only either modest tinkering or wholesale vandalism were the only really practical ways to play with a phage genome. You could rewrite a little or delete a lot. Despite that, it is possible to do a lot with these approaches. In today's PNAS preprint section (alas, you'll need a subscription to get beyond the abstract) is a paper which re-engineers the classic lambda switch machinery. The two key repressors, CI and Cro, are replaced with two other well-studied repressors whose activity can be controlled chemically, LacI and TetR. Appropriate operator sites for these repressors were installed in the correct places. In theory, the new circuit should perform the same lytic-lysogeny switch as lambdaphage 1.0, except now under the control of tetracycline (TetR, replacing CI) and lactose (LacI, replacing Cro). Of course, things don't always turn out as planned.

These variants grew lytically and formed stable lysogens. Lysogens underwent prophage induction upon addition of a ligand that weakens binding by the Tet repressor. Strikingly, however, addition of a ligand that weakens binding by Lac repressor also induced lysogens. This finding indicates that Lac repressor was present in the lysogens and was necessary for stable lysogeny. Therefore, these isolates had an altered wiring diagram from that of lambda.

. When theory fails to predict, new science lies ahead!

Even better, with the advent of cheap synthesis of short DNA fragments ("oligos") and new methods of putting those together, the possibility of becoming the "all the phage that's fit to print" is really here. This new field of "synthetic biology" offers all sorts of new experimental options, and of course a new set of potential misuses. Disclosure: my next posting might be with one such company.

Such rewrites are starting to show up. Last year one team reported rewriting T7. Why rewrite? A key challenge in trying to dissect the functions of viral genes is that many viral genes overlap. Such genetic compression is common in small genomes, and gets more impressive the smaller the genome. But, if tinkering with one gene also tweaks one or more of its neighbors, interpreting the results becomes very hard. So by rewriting the whole genome to eliminate overlaps, cleaner functional analysis should be possible.

With genome editing becoming a reality, perhaps it's time to start writing a genetic version of Strunk & White :-)

Omics! Omics!