Thursday, November 19, 2020

Bioinformatics Exercise: Gene-Specific Ser Codon Usage

I saw a provocative abstract in PNAS about the usage of serine codons in E.coli that triggered the "this could make an interesting student exercise" (for my prior effort, see: Exercise: A Sequence Signature for Transcription-Translation Coupling in Bacteria?).  The paper is behind a paywall (though a relatively cheap one at $10) so I haven't actually read the paper, but the purposes here that isn't really a problem -- I'm not going to critique the paper, but just use the concept as a springboard.
Serine is the only amino acid in the standard twenty which has codons which are more than one mutation apart: there's 4 TCN codons and then two AGY codons.  The paper looks at the usage of these codons and finds that some genes are exclusively one type or the other type.  That's an interesting observation, though I'm a bit skeptical of the authors tying it to the evolution of the primordial genetic code, particularly since the abstract promises the analysis only in E.coli.

For the student, the first task is to set up a framework for taking a given annotated genome in Genbank format and converting it to a table of genes with the appropriate statistics for the analysis -- total number of codons, number of TCN Ser, number of AGY Ser, etc.  Of course, there is great latitude here -- if you aren't scared by tables with many columns then just compute first the frequency of each codon (actually, this would be very useful) then columns that aggregate by amino acid and for TCN and AGY.  Ideally some other columns as well. Bonus points for getting this to deal with small eukaryotes and their introns.

This makes this a particularly interesting possibility for a course: each student could be assigned a different genome or set of genomes.  A first possible interesting result would be just to ascertain the number of genes in each genome which are TCN-only, AGY-only or mixed to a degree.

There's one hypothesis I would have thrown at the original dataset before claiming an ancient effect.  The choice of synonymous codons in organisms is not the same, and E.coli has been shown in the past to be dividable into three distinct classes of codon usage: high expressing genes, low expressing genes and a third class that is rich in horizontally transferred genes.  So do the patterns of AGY/TCN track with these?

Another key question is whether the results are stronger or weaker if we divide serines by how conserved they are.  With a database like PFAM, it should be possible to assign each serine in each protein  (or at least most serines in most proteins) a conservation score.  How many uber-conserved serines are TCN vs AGY?  

Yet another interesting look -- and related to that one -- would be to use various tools to build orthology relationships between genes in different organisms.  If a gene is all AGY in E.coli but not all AGY in Bacillus, that could be interesting.  

There's certainly other angles one could take.  For example, there are many phage in Genbank, so just computing this for phage genes could be interesting -- since we know they are mobile.  Or we could compare frequencies in highly conserved bacterial proteins (say, those with homologs in human) vs. less conserved ones.   

Another fun direction would be to take some genes showing extremes of TCN/AGY codon bias, generate a phylogenetic tree based on that gene and then color the nodes based on the TCN/AGY bias

I hope a number of aspects of this sketch are appealing to students and educators.  First, it exercises a number of different skills -- dealing with common file formats, dealing with orthology, dealing with trees.  There's also all sorts of opportunities to think about how to structure tables, which could easily segue into how to design a relational database.  Since there are so many complete genomes, it would be straightforward to split life between different students or teams.  An interesting variant on that would be to not lay down any rules in advance, then spend a class trying to reconcile the different table organizations that each team generated.  Now that's almost too real: so much time in bioinformatics is spent morphing data between formats and schema.  And there's lots of opportunities to visualize data and think about different ways of plotting the results.

No comments: