My recent piece on the folly of moonshot projects caught the eye of a Silicon Valley blogger named Kumar Thangudu (@datarade), who gave me some nice praise. But along with that were some notes of his on other big science projects, and one he targeted was the Human Genome Project, with a sarcastic comment about the paucity of drugs coming from the HGP. Them's fighting words! So I launched a tweet stream his way outlining a number of impacts of HGP on biomedical science, and he was very nice and suggested I write more on that. I'm thinking of a number of angles on the topic, but here goes a bit of nostalgia that I think illustrates a number of points on the topic. Except I'm first going to talk about Escherichia coli, and worse, talk about it before its genome was complete. But that's part of the point.
Chemistry's Periodic Table: A Useful Analogy to a Genome Sequence
Trying to teach people outside of genomics of the huge impact of the genome project can be very difficult. Explanations often require significant technical detail and are imperfect. It's also been long enough that even most persons within biology won't remember what it was like to do genetics or biochemistry or molecular biology before the genome sequencing era (though if you work on an organism that isn't sequenced, you'll still be in that world to some degree). Analogies are always dangerous, because they can be taken too far or important details can be lost. But one imperfect analogy for a genome sequence which I think has value is the periodic table.
Now, one risk in making an analogy is that too few people really understand even its major principles, let alone all the amazing things that sprung from it. This is evident in most of the graphical designs titled "Periodic Table of X", in which the well-known frame of chemistry's table is filled with items of type X with absolutely no order whatsoever. This misses the major point of the table. one of the better such tables I saw was in a hidden corner of a Disney World giftshop, with a periodic table of Disney characters. One column had all ducks, with Donald on top, Daisy down one row and so forth. Another column had all mice: Mickey at the top, then Minnie and then their nephews. I think Goofy and Pluto occupied a column of dogs and dog-like creatures. This is the essence of the periodic table; both the columns and rows have meaning.
But it's better than that: the periodic table engendered a whole series of hypotheses. Perhaps the best known is that it enabled the prediction of elements which hadn't yet been discovered. Not just that elements would be found, but key quantitative properties were predicted, and usually right. This is also a boon to chemistry students; when you see sodium (Na) sitting just below lithium (Li), you know they will have very similar properties, except sodium is heavier and even more willing to shed electrons. Even more interesting (though I've forgotten the details), were cases where properties such as atomic mass were not quite in agreement with theory; these led to identifying further quirks of atomic physics. We're past much of that excitement, but a number of books (such as Oliver Sacks' Uncle Tungsten and Richard Rhodes' Making of the Atomic Bomb) capture the excitement of the time when the periodic table was a roadmap for chemical discovery.
Even more to the theme of this post, the periodic table is computable. That is, if I point to a box on the table one can use general principles and trends to compute a number of useful properties; conversely given certain properties of an unknown element, one can estimate where it goes on the table. Sometimes these are easy computations, and sometimes hard ones. An example of an easy one, which can be taught to very young children, is the number of protons for a given box; it is a simple counting exercise. Others may require fancy quantum theory. There's also functions which are din are not perfectly reversible. For example, knowing the atomic mass of an isotope of a known element enables computing how many neutrons it has (subtract the atomic number from the mass), but since isotopes of different elements can have the same mass, simply knowing the mass of an atom won't let you determine precisely what element it is (though you can narrow the field). I'm sure there are also examples of trapdoor functions, much easier to compute in one direction than the other (a concept I introduced TNG to this weekend), but I'm not enough of a chemist to think of one offhand.
From Protein to Spot and Back Again
When I joined George Church's lab in the summer of 1992, E.coli had only been partially sequenced. A major effort to sequence was slowly getting off the ground, but hadn't yet made progress. George's group had sequenced about 3% of the genome in a particularly unexplored region, which turned out to have few genes of known function (and a quarter century later, sadly, few more worked out). In anticipation of the genome being completed, he and graduate student Andy Link had engaged in an ambitious proteomics project (though I think the word wasn't yet in use).
Imagine all the proteins in a cell such as E.coli; even at that time it was known there would be hundreds if not thousands of them. You'd like to identify them all, and identify how they changed with time and conditions. One approach in 1992 for this problem was a techique called 2D electrophoresis. In this, you first use clever conditions to sort proteins within a gel by the pH in which the protein's charges are all balanced (the isoelectric point, or pI). Then that gel is placed against another gel and run it in a perpendicular direction in conditions in which the proteins sort by size (molecular weight, or MW). In the ideal case each spot (visualized with a stain) on the final gel would be a single protein. Doing this on a broad scale in E.coli was pioneered by Frederick Niehardt, who died this month.
But now you have a lot of spots. You can tell if spots go away or appear or change intensity with changes in conditions, but what are these spots? Well, if you know the sequence of a protein, you can predict the MW and pI. Weight is really easy, but I botched my first attempt at calculating pI. Foolishly, I tried the clever approach of using Newton's method to find the inflection point of the charge vs. pH curve, whereas the simpler (and successful) solution recognizes that you don't need much precision, so simply computing charge at 0.1 pH unit steps from pH 1 to pH 10 will find the pI very quickly and safely.
Ahh, but what if the sequence is wrong? what if the protein has been clipped or modified in some way? Then the spot won't be in the "right" place! But where will it be?
What Andy did, and he did a lot of it, was isolate the spots and then subject them to automated Edman seuqencing. The development of this instrument is described a bit in Luke Timmerman's Lee Hood biography. Edman sequencing uses a cycle of chemical reactions to identify the first amino acid in the protein, then the second and third and so on. Since it is working on an ensemble of molecules with imperfectly complete reactions, in each cycle the signal grows more noisy. Many labs doing this would try to push for really long sequences, but George and Andy realized that for their purpose short ones would usually do -- and were faster and cheaper to boot.
So, imagine a spot comes out as M,I,S,S,A,M,A,N,D,A. That's about enough, if it is perfect, to search through the known E.coli proteins and try to find a match - such as to the shiH gene. Now, when the genome wasn't complete and your sequence didn't hit anything. Well, if you absolutely needed to know you could then back-compute the DNA sequence that would encode that and use that as one primer in a PCR scheme to capture the gene. However, this is a good example of imprecise computation, as because the genetic code is degenerate there are multiple options. The sequence above has two Ms, which is lucky since those have only 1 codon (ATG), but brutally it has two S, which is encoded by 6 codons. E.coli doesn't use all its codons equally, so that can bias the primer design and wobble positions can be filled with mixes of nucleotides, but strictly speaking there are 1x3x6x6x4x1x4x2x2x4=27,648 possible primer sequences (one could pick a substring to make life easier; the final 5 gets it down to only 128 possibilities). Ordering those primers costs time and money, and the experiment is not guaranteed to work the first time (or at all), and might only get part of your gene of interest, requiring a new round of primer design to get the rest.
George counseled the patient approach: wait for the genome to be sequenced. Now that sequence need just be matched to either a lookup table of predicted proteins, or if you want just searched directly against the 6-frame translations of the entire DNA sequence (using a program such as TBLASTN). Comparing against the predicted proteins is particularly useful, since that is a test on the predictions; did they find the protein and nail the N-terminus? This can be tricky, as bacteria can use a number of possible start codons (generally some choice out of CTG, TTG and GTG atop the canonical ATG), plus there can be processing of a protein at the N-terminus. For example, several signals for sorting a protein to the correct location are N-terminal and can be removed once the protein is at its permanent home. Note that in the case of significant processing, the protein almost certainly will not be at the spot on the gel which would be predicted from the raw sequence.
Now, sometimes a more complex result would come out of the Edman sequencer. (M,I),(I,S),S,(A,S),(A,M),(A,M),(A,N),(D,N),(D,A),(D,A). This indicates that the spot wasn't pure; there was more than one protein in the sample. At many positions, two (sometimes three or more) amino acids come out, but there is no way to tell which go together. Or is there? If we assume two proteins present, there are 256 possible pairs of sequences there; good luck designing PCR primers for all that! But given the genome and predicted proteins, one can convert this to a regular expression and search the database. I've constructed that particular mixture to reflect a common occurence: one is the original sequence and one is that sequence minus the starting N-formyl methionine (the special methionine used to start proteins). Heterogeneity in removing that first N-formyl methionine is common. We saw that pattern a lot and learned to look for it. But we also saw plenty of cases where two completely different proteins were overlaid.
If you want to read more details, we published the work in Electrophoresis, though unfortunately the article is paywalled.
Edman's Successor: Protein Mass Spectrometry
Laboratory strains of E.coli have on the order of 5,000 raw proteins; humans have a base capacity for four times that, and then post-translational modifications (which often change the location) are more common in eukaryotes than bacteria. The odds of getting perfect separation on a 2D gel go down with greater complexity, and in any case 2D gels are very labor intensive and finicky. The Edman sequencing gets only the N-terminus; if you really want to check everything out one would need to break the protein into multiple pieces, separate those and sequence each one. That's more tedium.
A new class of technology arose to dominate proteomics, and that technology is protein mass spectrometry. Mass spectrometry is one of the innovations from the time when the periodic table was incompletely filled and is truly amazing: these machines weigh ions (charged atoms and molecules). It took a while ( about 3/4 of a century) to go from weighing atoms and small molecules to weighing entire proteins, but that technology is now very established. Two basic schemes dominate in protein work. Quadrupole mass spectrometers shoot the molecule towards a target and then apply a bending force with a magnet; since f=ma, the mass of the molecule can be computed from the degree of deflection of its path (but the force is proportional to the charge on the molecule, so what is actually measured is the ratio of mass:charge aka m/z). Also important here is the fact that one can steer ions with magnetic fields.
I've actually used a basic mass spec for a basic process. When I was a high school intern at Penn, an ultrahigh vacuum system was clearly leaking after we had changed out some equipment. The instrument had a simple mass spec built in. I was assigned to take a Pasteur pipet linked to a helium tank and patiently run the pipet tip around each suspect area, and look for a spike in the helium mass on the instrument. Which revealed, sadly, that the high school intern had used the wrong sized bolt in reassembling a flange, just pinching a metal diaphragm.
Anyway, back to proteins. Proteins are big things and many proteins will have the same raw weight, so how do you tell them apart? First, one can use chromatography upstream to try to fractionate a messy biological sample into bins. Multiple dimensions of chromatography can be used in much the same manner of the two axes of the 2D gel. Chromatography can be automated, which is a huge plus, and there are more dimensions available which can be used in series, but the more dimensions, the more samples created to analyze and therefore more instrument time required. But for most interesting samples, chromatography alone will not be sufficient to get to a single protein, and even if it does multiple proteins can have the same molecular weight.
One other pre-processing step which is common in the mass spec is to digest the input, often after some chromatography, into peptides using a protease. Trypsin is heavily used, which cuts after most R (arginine) and K (lysine) residues, both of which bear a positive charge on their side chain. Small peptides are more amenable to mass spectrometry than entire proteins, but now you have a complex mixture. Chromatography on that mixture can reduce the complexity, but again trades sample complexity for number of fractions.
Let's suppose that all this has been wonderfully successful, and each fraction queued to go into the mass spec bears at most one peptide. How do you identify what that peptide is and what it means?
A big trick in the mass spec field is to use multiple stages of mass specs, with collision cells in between. One can use the first stage to select a given mass range and steer that into a space filled with a gas at very low pressure. The ions collide with the gas molecules, leading to breaking up of the input molecule into fragments (computing these, I understand, is not quite a fully solved problem, particularly for large, complex molecules). These can then be analyzed for size, producing a fingerprint of the original mass. If the instrument is designed for it and enough ions are around, as some are lost to noise at each step, another round of steering-collision-analysis can fingerprint each of the fragments.
Now a number of methods have been developed to take these fragment patterns to infer the possible protein sequence that produced them. One complication is that some amino acids weigh the same or nearly so (leucine and isoleucine are isomers, with exactly the same number of each atom). Furthermore, amino acids may be modified by either damage or post-translational enzymes. Some spectrometers are crazy accurate, but the problem can easily be under-specified: many possible amino acid sequences could yield a known mass. Understanding cleavage patterns helps immensely here, but that involves more complicated instruments and computations. Remember also in all this that one has multiple peptides, so if a single protein was digested then guesses for each peptide can be cross-matched to find the common protein that could produce all of them.
Or, if you possess a complete predicted proteome of an organism, one can match the observed spectral data to the predicted spectral data for the organism. Less experimental time is required to generate the data and a simpler computation is involved: close to a lookup.
I hope through this that I've illustrated one aspect of why having a genome sequence so changes the way experiments are run. The genome becomes a central artifact, much like the periodic table in chemistry, around which much else is organized. The sequence provides the way to make predictions and also enables simpler lookup-style operations to replace difficult computations to interpret high-throughput experiments (in this case, protein mass spectrometry). Because such experiments are likely to yield a position on a map, rather than a new experimental headache, new experiments can more quickly be interpreted in the light of previous findings. The design-experiment-interpret cycle becomes significantly faster.
I'm thinking about writing several more pieces on this theme of how the genome revolutionized biology. Not sure how many, but I do have several roughed out in my head. One along these lines that is in my archives was my passionate rant on the success of cancer genomics, which is often a target for big science critics.
This also shows how getting a genome sequence is so different from Apollo in terms of impact. Apollo was a great engineering project (as a recent essay put it, we should celebrate Apollo and condemn Apolloism) and yielded great insights about the history of the moon, but it didn't change the way science was done. There was no radical restructuring of the way anyone thought about science post-Apollo. It didn't yield a set of organizing principles for space science. But the human genome sequence (and any other reference genome sequence) did. We don't divide science into pre-genome and post-genome because it is handy from a historian's perspective; it's a natural division due to a tectonic shift in the practice of science.