Saturday, October 27, 2007

Genomics Lemonade

One of the many attractions of the next generation sequencing techniques is that they eliminate the step of cloning in E.coli the DNA to be sequenced. Not only does this step add complexity and expense, but it also detracts from results. Shotgun sequencing attempts to reconstruct the genome from a large random sample of fragments, but there are some pieces of DNA which clone poorly or not at all in E.coli, skewing the sample. These regions have often required labor intensive, expensive targeted efforts to finish.

However, when life gives out lemons, some break out the sugar and glasses. A new paper in Science Express (subscription required) turns this phenomenon around in a clever way. All those failed clonings weren't nuisances, but experiments -- into what can be cloned into E.coli. And since horizontal transfer of genes is rampant in bacteria, it's an important phenomenon with relevance to medicine (virulence genes are often transferred). And on a huge scale: 246K genes from 79 species, using 1.8 million clones covering 8.9 billion nucleotides.

The first filter was to identify short genes which rarely showed up in toto in plasmid clones, looking at short (<1.5Kb) genes since longer ones will rarely be complete in a short insert clone. Now, common plasmid vectors replicate at multiple copies per cell. To further refine the list, the authors also looked for evidence that these genes were underrepresented in long-insert clones, which typically are in vectors which replicate at a few copies per cell.

No one gene was poison from every species, but the 'same' gene from closely related species was often trouble. Species related to E.coli often had more toxic genes, perhaps because these species already had promoters which could drive significant expression in E.coli. So, they took examples from 31 species two such genes (both for ribosomal proteins) 3under the control of an inducible promoter, and showed much greater toxicity when the promoter was turned on. 15 randomly chosen control genes did not show toxicity.

What kind of genes transfer poorly? One major class are proteins involved in the ribosome, a class previously noted to be rarely found amongst genes thought to have been horizontally transferred. One posssible inference for this is that the ribosome is a highly tuned machine, with excess components able to fit in but not fully function. Interestingly, the proteins in direct contact with ribosomal RNA were found to be more likely to be in the toxic set.

Another test was to simply look at what E.coli genes can't be transferred into E.coli -- well transferred from single copy in a wild-like strain to multi-copy in a lab strain. Such genes are probably toxic purely due to dosage effects (such screens have been used to great effect in the past, e.g. this)

What's missing from the paper? Two quick questions came to my mind. First, how many of the genes are essential in E.coli? Second, what if you simultaneously knocked out the endogenous copy and expressed the foreign one -- would that lessen the toxicity?

There are other examples of leveraging trouble into something interesting that I have had some connection to.

During the early 1990's, no sequencing was going fast enough for young, impatient folk, especially E.coli. At one Hilton Head Conference, there was loose talk of a 'schmutz' genome project -- we would go through all the unalignable reads from all the genome sequencing centers, figuring that a significant fraction were E.coli contamination & therefore might help fill in the E.coli genome. Alas, we never actually pushed forward.

When I was at Millennium in the late 1990's, we were mining a lot of EST data from our own libraries, from the public collections, and from the in-licensed Incyte databaes. A constant minor nuisance was the presence of different contaminants in these collections, and at one point I had my group trying to clean this up. We could successfully identify a number of contaminants, which were sometimes very center-specific. For example, the Brazilian EST collections had contamination from the citrus (lemon?) pathogenic bacteria they were sequencing at the same time. I regarded this solely as a cleanup operation, and when we were done we were done -- but of course some people think more cleverly & so I was chagrined to see a paper by George Church and company using this technique to associate bacteria and viruses with human disease.

All that writing has made me thirsty. Lemonade anyone?

No comments: