Monday, May 20, 2019

Mass Recoded E.coli Genome Not Tripped Up By Programmed Frameshifts

There's a paper this week in Nature announcing an E.coli genome which has had two serine codons (UCA, UCG) and one stop codon (UAG) removed from usage.  It's a major work on synthetic biology and represents the largest designed sequence ever built.  In contrast to Craig Venter's early effort, which moved a synthesized genome into a cellular ghost of a natural bacterium, this one replaced the native E.coli genome in stages -- Escherichia theseusshipii would be a good name for the new strain. But is the genome quite what is advertised? Following up on a pair posts from Sandeep Chakraborty showing remaining UCA, UCG codons and UAG codons in a bunch of typical genes, I decided to look for a trickier set of possibilities to overloop -- and by luck or care the Nature paper got these.  Just to put one gripe front-and-center, the group deposited in Genbank the reduced genome version of E.coli they started with, but not the recoded genome, which is in the supplementary material

I embrace the weirdness that is biology.  I love the exceptions and enjoy learning new examples.  I cringe at nearly every use of "always" or "never" in explaining biology -- it's biology and there is nearly never a rule without an exception (sure, I'm allowed to violate this).  I never refer to the most common genetic code as "universal" -- it's canonical.  Our own mitochondria use a non-standard code with four differences so that there are four stop codons, two tryptophan codons and two methionine codons.  There are even codes which don't have dedicated stop codons.  And there is of course context-specific insertion of selenocysteine and pyrrolysine.

A perhaps greater deviation from the genetic code and translation as commonly taught is programmed ribosomal frameshifting.  At certain "slippery sites", the ribosome will not step 3 bases but can slip backwards one.  This feature is found to be used in a wide number of genetic systems.  E.coli has three genes reported to have programmed frameshifting.

DnaX encodes two different DNA polymerase subunits via ribosomal frameshifting.  The canonical reading frame encodes tau with 643 amino acids but a programmed frameshift essentially truncates the protein to gamma with 431 amino acids; only the final amino acid in gamma differs from the cognate position in tau.  Interestingly, while the frameshifting site is conserved in a number of bacteria (e.g. Thermus aquaticus), according to the Uniprot entry mutations which eliminate the frameshift (and hence gamma) are viable whereas expressing gamma but not tau is inviable.  This region was not recoded -- so the frameshift should happen -- but the alternate frame uses stop codon UGA.

CopA is a copper transporter; the same gene has been shown to use ribosomal frameshifting to encode a copper chaperone called CopA(Z) [a nomenclature style, by the way, I despise since it requires using punctuation marks -- how are you supposed to speak that?).  The CopA gene has all the appropriate recodes to eliminate the serine codons which have been targeted.  The CopA(Z) chaperone is again mostly a truncation, substituting a single amino acid at the shift location -- a glycine which now is the last amino acid, and uses the still valid UAA stop codon

But perhaps most interesting is the third known example: prfB encoding Release Factor 2.  E.coli has three release factors for performing translational termination at stop codons.  The "nonsense" nomenclature for stop codons is a very unfortunate one, as rather than not being sense (and particularly they are not nonsense) they have a very definite sense -- to terminate. Indeed, failure to be able to translate a codon results in ribosome stalling, not termination, which requires a peptide hydrolysis activity in the release factor.  Release Factor 1 acts at UAG and UAA, Release Factor 2 at UGA and UAA and Release Factor 3 at UGA.  Note that only RF1 can execute termination at the UAG stop codon targeted for removal.  However, in their Genbank file prfB does not have a coding region and is labeled "mutated gene, peptide chain release factor RF-2".

But this is incorrect; the gene is not mutated but requires a programmed frameshift for correct translation.  The fun part here is that the canonical reading frame has a UGA stop codon; insufficient amounts of termination factor cause ribosome stalling, which encourages ribosomal frameshifting into the alternate reading frame.  There are notes in the Genbank file that "manual edits" occurred in the RF2 reading frame -- so perhaps the annotation wasn't in sync with the editing effort.

This recoding paper has re-stirred some thoughts I've had for (in my opinion) fascinating but probably purely academic genome rewrites.  One of these I thought I wrote up, but neither browsing the last few years here nor Google finds it -- potentially a victim of over-thinking the writing of something.  Maybe when I get back from London Calling I'll try to put some effort on that -- but LC will be the focus of the rest of my week.

No comments: