A central problem with the parts list point-of-view is that we can't make much sense out of it. Our "parts" are amino acid sequences, and we lack the ability to routinely fold them into correct three-dimensional structures. Even with the vast progress on that front, given a three dimensional structure we aren't yet adept at identifying the function of a protein. We also must not fall into the trap of thinking one protein = one function. In addition to any enzyme having a certain spectrum of allowable substrates, many proteins perform multiple roles in a cell that may not be obviously related to each other ("moonlighting").
So we work mostly by analogy. Take all your proteins and compare them, probably with BLAST, to all other proteins. A wise analyst will also use libraries of protein family models such as PFAM. There will be a number of typical patterns to the results.
The happiest category are close matches to proteins of known function, particularly those which are typically found in one or a few copies. These one can have the highest functional assignment confidence (though never 100%, remember the moonlighting!). Depending on the family and the level of identity, you might also hedge your bets on the precise cofactors (NADH or NADPH?)
Another category will be proteins which have similarity to proteins with known function, but the similarity is edging towards the twilight zone. Perhaps even past it; maybe only with PFAM+HMMER or PSI-BLAST is the similarity picked up. So you might assign a general category of enzyme ("dehydrogenase"), but be very cautious in guessing the precise substrate. Maybe you can lean towards a class of substrate ("looks like a sugar dehydrogenase") but maybe not.
The next, far more frustrating category are the conserved proteins of unknown function, or proteins with Domains of Unknown Function (DUFs). Clearly these are important, as evolution has held onto them over vast distances, but we are clueless as the their function. Maybe occasionally we have a distant clue from some other analytical technique, such as predicted transmembrane domains.
Then the most enigmatic, the clade-specific predicted proteins of unknown function. Perhaps these are gene prediction excess or perhaps they truly are specialized proteins needed only by certain types of organisms.
We've had a complete sequence for Escherichia coli for twenty years, yet many of the genes which were of unknown function then are still of unknown function now. Perhaps a lack of full investment in this endeavour is partly to blame, but its also just very hard to do. But, in my opinion, it is a worthy task.. Sometimes sequencing a genome is compared to obtaining a periodic table; can you imagine a periodic table in which the basic properties of the elements (save the ones with infinitesimal stability) weren't known?
The most notorious attempt at high throughput function assignment was the Reactome array. In this scheme, substrates were arrayed on solid supports and an enzyme presented to the array in a way that was supposed to generate a fluorescent signal if the enzyme was active on a given substrate. Chemists immediately cried "Oh, really!??" when it was published in Science, arguing that getting so many different assays all working in the same format was a task unlikely to succeed. The paper was ultimately retracted, but the method still has its believers (I had a former colleague swear recently it gave useful data).
A new paper from Uwe Sauer and colleagues (behind a paywall, but I think much of the same information can be found in the lead author's Ph.D. thesis) again tries to assay many substrates in parallel, but with a very different universal assay. Instead of arraying substrates, the paper uses cell lysates and the enzyme's activity is monitored by searching for changes in mass spectra (MS). In a screen of the 1,275 E.coli proteins with unknown function, they report metabolome changes or 241 of these. A set of 12 were characterized in depth and assigned specific biochemical functions.
The workflow begins by pooling extracts of E.coli grown under two different conditions, in order to expand the range of represented metabolites. Metabolites were analyzed by time-of-flight mass spectrometry, yielding 4,720 distinct ions, but only 777 of these were tentatively assigned to known metabolites. This illustrates both a limitation of the method but also an opportunity: it is possible to identify an ORF that catalyzes the consumption of an unassigned ion or creation of one (or both). Such ORFs can't be tagged with a reaction, but such finding do identify target ions whose identification would yield specific benefits for understanding the E.coli biochemical network.
Enzymes were presented to these lysates either as purified proteins (using His-tag methodology), or by simply mixing the expression strain lysates with the metabolome lysate. The latter has the advantage of lower cost and complexity, as well as some additional hope of including key cofactors or partner proteins potentially lost in purification.
In order to validate the assay, the authors included 189 known enzymes in the assay, although only 121 of these were purified. These were also used to calibrate the assay. Changes in metabolites were observed for 34 of these known 121 enzymes. Interestingly, for 4 of these only unknown metabolites changed in the assay and for 8 others metabolites not in their known reactants or products were seen to change. The authors used this data to estimate their precision as 88% and recall as 25%, but also illustrates the possibility of finding new functions for old enzymes. A wide variety of reaction types were found, with the obvious exception of isomerizations, since the reactant and product will have the same mass.
For the 1,275 functionally uncharacterized proteins, 241 gave a signal in the assay. This 1/6 hit rate is lower than the 1/4 seen with the known enzymes, but since many of the uncharacterized proteins may not be enzymes this is not surprising. A few ions were excluded at this stage as they were affected by a large number of proteins. All but 15 of the 241 appeared to be specific, with only a few ions shifted.
A number of informatics approaches were now used to assign possible functions. For example, annotated ions were used, sometimes with slightly relaxed cutoffs, to assign ORFs to known reactions for E.coli. This, for example, assigned YgdH as a uracil monophosphate phosphatase.
Another approach was to predict possible reactions by identifying similar molecules. The logic here is that most enzymes perform modest tweaks on their substrates, so two compounds that differ by only a tweak are candidates for a product-reactant pair. A related technique is to use the mass difference between two observed ions to predict the transformation; if two differ by the mass typical of a phosphorylation, then phosphorylation becomes a candidate reaction. For example, YgjP is homologous to nucleotide phosphatases and trigged a transition between two unknown metabolite ions consistent with dephosphorylation.
To really nail this down, the authors picked 16 candidate novel enzymes for specific follow-up. These proteins were purified and subjected to pure predicted substrates and monitored over time by MS. Four of these predictions failed to pan out, most likely due to misassignment of the correct ion. After all, multiple metabolites may yield the same ion, leading to ambiguity. But the remaining 12 could be assigned to 29 specific reactions.
For another approach at validation, the authors took the 223 which are viable as single gene deletions and searched those lysates for metabolites. Take a pause here an consider: 18 essential genes were given putative function by their assay. While metabolic models of E.coli give important insights, we clearly don't understand some of the critical biochemistry! In any case, 18 of these null mutants showed metabolite changes consistent with the in vitro protein assay. Furthermore, the authors found relevant phenotypic information for several genes, using the predicted reactions to guide the search. For example, YfbT is a hexitol dephosphorylase which contributes to butanol tolerance, an important phenotype for some attempts to produce biofuels.
When I first joined George Church's lab, he was in the process of sequencing 3% of the E.coli genome. He had deliberately picked a contiguous region nearly opposite the origin of replication, as there were few mapped genes there. Would this prove to be a wasteland?
The region we sequenced and annotated proved to be plenty gene-rich, so the ORFs piled up. If you're wondering why genes I've cited have begun with 'y', that's a convention Kenn Rudd dreamt up which proved very useful. For each ORF. the 'y' indicates unknown function and then the next two letters give a position on the E.coli chromosome in percent (also known as minutes, based on it taking 100 minutes to conjugate an entire E.coli K12 chromosome). So YgdH is from region 'gd', which is 63 (g=6, d=3), and this was the 7th ORF of unknown function found in that region. We found so many such 'URFs' in our region that Kenn had to expand his scheme; once we ran out of Yei names we moved to Yoi, as O is the 15th letter of the alphabet. Luckily we never needed to go to Yyi names!
When we ever complete the parts list for E.coli? Again, some of that is a question of investment, which often seems modest, but it will certainly require a number of different approaches. Another paper by many of the same authors is on my to-read list, and makes further inroads. This paper, as noted in some of my (and their test) offers many loose ends, particularly with unassigned metabolite ions. There is also the possibility of increasing metabolite diversity with even more extracts, or by somehow supplementing with compounds likely to be encountered in the environment. For example, this approach would never identify beta galactosidase unless the media were supplemented with lactose.
There are also a number of ways to use the logic of bacterial operons to extend this paper. For example, there is also the possibility of revisiting some enzymes which may be multisubunit in nature. Since multiple subunits for the same enzyme will often be in the same operon, a limited number of pooled proteins or multi-protein expression constructs could be run through the assay. Also, since multiple enzymes in the same pathway are often clustered in operons, getting even a toehold with one operon member using this MS approach should open new avenues to pin down the other operon members.
A final thought. A Twitter comment on my recent piece on Gen9 (and if you haven't read CEO Kevin Munnelly's comments, you really must as they give a useful perspective) was that I had an unstated assumption that cheaper gene synthesis costs would translate to a wealth of new experiments. It is a very fair criticism that even if the cost of synthesis went to zero, many experiments would not be affected. However, a paper such as this illustrates where nearly free gene synthesis could have an impact. While the current paper relied on an existing collection of expression clones, if you wanted to repeat this for most organisms either a large PCR cloning campaign or a large gene synthesis order would be needed. Just as an estimate, at $0.05 per base and 1200 genes averaging a kilobase in size, a study of similar scale to this one would require about $60K of synthesis.