The canonical genetic code enables the incorporation of 20 amino acids into proteins. I joked at one point that I'd teach my child only 20 letters, as who needs the rest? Once I even attempted to write a piece for here with only those twenty letters (I think I slipped up and let a single forbidden letter in). Why those 20, and not the numerous other amino acids available in nature, has been the subject of a great deal of speculation.
But what about dropping one? Cutting down the number of codons in use has seen progress - a 57 codon E.coli genome was recently published. By eliminating the usage of certain codons, those are freed up for use in expanding to new amino acids to incorporate. That is potentially very useful, enabling genetic encoding of new peptide and protein functionalities. But can we just extirpate an amino acid from the proteome? That has no obvious utility, but any such mass perturbation is surely going to uncover something interesting.
Even trimming out codons proves challenging. Sometimes the choice of codon can affect protein folding, sometimes there are overlapping reading frames (perhaps on opposite strands) and other times there are other signals overlaid with this coding region. Perhaps the codon you change is important for forming - or avoiding - some secondary structure. Perhaps it is within the binding site of a regulatory protein or RNA. And so on.
We could edit out an amino acid, or another approach would be to create genetic circuits so that for certain codons we can control the ratio of which amino acid is incorporated: the "proper" one or our substitution. Changing the codon brings in the same complications as genetic code reduction attempts; having translation choice be dialable avoids that but means we can't pick which amino acid is substituted where.
As far as which amino acid, I will nominate two: tryptophan and isoleucine.
Tryptophan is the rarest amino acid and has been hypothesized to be the last of the 20 to enter translation during the evolution of the genetic code. So that means the least changes to the genome and proteome. The catch is that tryptophan is quite distinctive. I don't know the chemistries that tryptophan engages in that would cause it to be conserved, but looking at the BLOSUM62 substitution matrix tryptophan has the highest value for being matched with itself: 11, and the best substitution (tyrosine) scores only a 2.
So tryptophan elimination won't be easy. It does suggest an interesting test of gene essentiality for tryptophan-containing proteins: how much does replacing the tryptophan impair fitness under some standard growth condition? For a greatly gene reduced species such as the JCVI version of Mycoplasma, probably nearly every (and it's only a few hundred IIRC; I once computed this) tryptophan is essential. But in a non-reduced typical E.coli K12, perhaps many genes aren't important and their tryptophans can be safely erased.
For tryptophans that are essential, a deep dive would be needed to determine why. Then it's the interesting question of whether you could train an AI to figure out how to replace the required properties of the tryptophan - or perhaps have a model identify alternative pathways that are not tryptophan-dependent. Hmm, just having a version of BioCyc marked up by whether the enzymes contain an essential tryptophan could be an interesting resource to access.
My other candidate is isoleucine, as it would seem having both leucine and isolencine in a genome is redundant. But looking at BLOSUM62 again, it turns out the closest double to isoleucine (and indeed, the closest scoring pair of all) is valine - either amino acid scores 4 against itself and that is only reduced to 3 for isoleucine vs valine.
So how many isoleucines can you replace with valine in a proteome and retain full fitness - and how often when there is a fitness reduction can you tweak some other positions to something other than isoleucine and restore function? Again, possibly an interesting special case for AI - can you train a model to detect problematic isoleucine->valine changes and to go further and propose the restorative.
It is interesting that a recent preprint claims that using only a 10 amino acid set - no basics or aliphatics (and some others) - it is possible to build stable proteins. Stable, though not necessarily having all the functions of modern proteins. Still, it is suggestive that what I propose is not nearly as radical as what is possible.
Back on isoleucine-leucine-valine trio, it would be an interesting undergraduate-level exercise to troll large collections of multiple alignments and search for the positions which show the greatest skew either towards one amino acid (e.g. is there any position that is always leucine and never valine or isoleucine? )- or away from an amino acid (e.g. is there any position which is heavily leucine or valine but never isoleucine?)
If you explore any of these, I hope you will post a link to your code / preprint / notes in the comments - they're always open for that!
1 comment:
Interesting thought experiment, Keith.
Using a restricted alphabet to observe a reduced set of emergent properties is a neat way to tackle mapping functional domain redundancies. On a smaller scale (perhaps within a single protein but not knocking out an amino acid globally) I could see that mutagenesis approach being a good complement to deep mutational scanning and saturation genome editing. At the very least, it could negate some of the circularity problems we run into when benchmarking variant effect predictors on functional data.
Best,
Caitlyn Chitwood
Post a Comment