Wednesday, November 29, 2006

Phage Renaissance

Bacteriophage, or phage, occupy an exalted place in the history of modern biology. Hershey & Chase used phage to nail down DNA (and not protein) as the genetic material. Benzer pushed genetic mapping to the nucleotide level. And much, much more. Phage could be made in huge numbers, to scan for rare events. Great stuff, and even better that so many of the classic papers are freely available online!

Phage have also been great toolkits for molecular biology. First, various enzymes were purified, many still in use today. Later, whole phage machinery were borrowed to move DNA segments around.

Two of the best studied phage are T7 and lambda. Both have a lot of great history, and both have recently undergone very interesting makeovers.

T7 is a lytic phage; after infection it simply starts multiplying and soon lyses (breaks open) its host. T7 provided an interesting early computational conumdrum, one which I believe is still unsolved. Tom Schneider has an elegant theory about information and molecular biology, which can be summarized as locational codes contain only as much information as they need to be located uniquely in a genome, no more, no less. Testing on a number of promoters suggested the theory valid. However, a sore thumb stuck out: T7 promoters contain far more information than the theory called for, and a clever early artificial evolution approach showed that this information really wasn't needed by T7 RNA polymerase. So why is there more conservation than 'necessary'? It's still a mystery.

Phage lambda follows a very different lifestyle. After infection, most times it goes under deep cover, embedding itself at a single location in its E.coli host's genome, a state called lysogeny. But when the going gets tight, the phage get going and go through a lytic phase much like that of T7. The molecular circuitry responsible for this bistable system was one of the first complex genetic systems elucidated in detail. Mark Ptashne's book on this, A Genetic Switch, should be part of the Western canon -- if you haven't read it, go do so! (Amazon link)

With classical molecular biology techniques, only either modest tinkering or wholesale vandalism were the only really practical ways to play with a phage genome. You could rewrite a little or delete a lot. Despite that, it is possible to do a lot with these approaches. In today's PNAS preprint section (alas, you'll need a subscription to get beyond the abstract) is a paper which re-engineers the classic lambda switch machinery. The two key repressors, CI and Cro, are replaced with two other well-studied repressors whose activity can be controlled chemically, LacI and TetR. Appropriate operator sites for these repressors were installed in the correct places. In theory, the new circuit should perform the same lytic-lysogeny switch as lambdaphage 1.0, except now under the control of tetracycline (TetR, replacing CI) and lactose (LacI, replacing Cro). Of course, things don't always turn out as planned.
These variants grew lytically and formed stable lysogens. Lysogens underwent prophage induction upon addition of a ligand that weakens binding by the Tet repressor. Strikingly, however, addition of a ligand that weakens binding by Lac repressor also induced lysogens. This finding indicates that Lac repressor was present in the lysogens and was necessary for stable lysogeny. Therefore, these isolates had an altered wiring diagram from that of lambda.
. When theory fails to predict, new science lies ahead!

Even better, with the advent of cheap synthesis of short DNA fragments ("oligos") and new methods of putting those together, the possibility of becoming the "all the phage that's fit to print" is really here. This new field of "synthetic biology" offers all sorts of new experimental options, and of course a new set of potential misuses. Disclosure: my next posting might be with one such company.

Such rewrites are starting to show up. Last year one team reported rewriting T7. Why rewrite? A key challenge in trying to dissect the functions of viral genes is that many viral genes overlap. Such genetic compression is common in small genomes, and gets more impressive the smaller the genome. But, if tinkering with one gene also tweaks one or more of its neighbors, interpreting the results becomes very hard. So by rewriting the whole genome to eliminate overlaps, cleaner functional analysis should be possible.

With genome editing becoming a reality, perhaps it's time to start writing a genetic version of Strunk & White :-)

Monday, November 27, 2006

Graphical table-of-contents

I am a serious journal junkie, and have been for some time. As an undergraduate I discovered where the new issues of each key journal (Cell, Science, Nature, PNAS) could be first found. In grad school, several of us had a healthy competition to first pluck the new issue from our advisor's mailbox -- and the number of key journals kept going up. Eventually, of course, all the journals went on-line and it became a new ritual of hitting the preprint sites at the appropriate time -- for example, just after 12 noon on Thursdays for the Cell journals. A good chunk of my week is organized around the rituals.

Most pre-print sites, indeed most on-line tables-of-contents, are barebones text affairs. That's fine and dandy with me -- quick & easy to skim. But, I do appreciate a few that have gone colorful. Some now feature a key figure from each article, or perhaps a figure collage specifically created for display (much like a cover image, but one for each paper).

Journal of Proteome Research is at the forefront of this trend. Of course, since it is a pre-print site the particular images will change over time. As I write this, I can see a schematic human fetus in utero, flow charts, Venn diagrams, spectra, a dartboard (!), bananas, 1D & 2D gels, a grossly overdone pie chart, and much more.

Nature Chemical Biology is the other journal I am aware of with this practice. The current view isn't quite such a riot, because NCB doesn't have the large set of pre-prints that JPR has, but both a fly and a worm are gracing the page.

The graphical views do provide another hint of what might be in the paper beyond the title. In particular, they give some feel for what the tone of the paper might be (that dartboard must indicate a bit of humor!). They certainly add some color to the day.

Sunday, November 26, 2006

Gene Patents

Today's Parade magazine has an article titled "How Gene Patents are Putting Your Health at Risk". The topic of gene patents deserves public scrutiny & debate, but better coverage than this article.

Featured prominently (with a picture in the print edition) is Michael Crichton, whose new book has been touched on previously in this space. Crichton in particular makes a number of concrete statements, some of which are a bit dubious.

First, let's take the statement
A fifth of your genes belong to someone else. That’s because the U.S. Patent Office has given various labs, companies and universities the rights to 20% of the genes found in everyone’s DNA— with some disturbing results.
. The first sentence is just plain wrong, and given its inflammatory nature that is very poor journalism. Nobody can own your genes -- genes, as natural entities, are not themselves patentable. What can be patented are uses of information in those genes. That is a critical, subtle distinction which is too often lost. What can be patented are uses for genes, not the genes themselves, just as I could patent a novel use for water, but not water itself.

Time for the full disclosure: I am a sole or co-inventor on 11 issued gene patents (e.g. U.S. Patent 6,989,363) , many of which are for the same gene, ACE2. Many more gene patents were applied for on my behalf, but most have already been abandoned as not worth the investment. Those patents attempted to make a wide range of claims, but interestingly they missed what may be the key importance for ACE2 (we never guessed it), which is that it is a critical receptor for the SARS virus.

Many of the gene patents do illustrate a key shortcoming of current patent law. When filing a gene patent, we (and all the other companies) tried to claim all sorts of uses for the information in the gene. These sorts of laundry lists are the equivalent of being able to buy as many lottery tickets for free. A rational system would penalize multiple claims, just as multiple testing is penalized in experiment designs. The patent office should also demand significant evidence for each claim (they may well do this now; I am no expert on the current patent law).

Another one of Crichton's claims deserves at least some supporting evidence, plus it confuses two distinct concepts in intellectual property law
Plus, Crichton says, in the race to patent genes and get rich, researchers are claiming they don’t have to report deaths from genetic studies, calling them “trade secrets.”

First, just because some idiots have the chutzpah to make such claims doesn't mean they are believed or enforceable. Second, such claims have nothing to do with gene patents -- such claims could exist in any medical field. Finally, trade secrets and patents are two different beasts altogether. In a patent, the government agrees to give you a monopoly on some invention in return for you disclosing that invention so others may try to improve on it; a trade secret must be kept secret to retain protection and should someone else discover the method by legal means, your protection is shot.

The on-line version also includes a proposed "Genetic Bill of Rights". I would propose that before enacting such a bill, one think very carefully about the ramifications of some of the proposals.

Take, for example,
Your genes should not be used in research without your consent, even if your tissue sample has been made anonymous.
. What exactly does this mean? What it will probably mostly mean is that the thicket of consent hurdles around tissue samples will get thicker. Does this really protect individual privacy more, or is it simply an impediment which will deter valuable research? Will it somehow put genetic testing of stored samples on a different footing than other testing (e.g. proteomic), in a way which is purely arbitrary?

Another 'right' proposed is
Your genes should not be patented.
First, an odd choice of verb? "Should"? Isn't that a bit mousy? Does that really change anything? And what, exactly, does it mean to patent "your genes"?

On the flip side, I'm no fan of unrestricted gene patenting. All patents should be precise and have definite bounds. They should also be based on good science. Patents around the BRCA (breast cancer) genes are the most notorious, both because they have been extensively challenged (particularly in Europe) and because the patent holders have been aggressive in defending them. This has led to the strange situation in (at least part of) Europe where the patent coverage on testing for breast cancer susceptibility depends on what heritage you declare: the patent applies only to testing in Ashkenazi Jews.

In a similar vein, I can find some agreement with Crichton when he states
During the SARS epidemic, he says, some researchers hesitated to study the virus because three groups claimed to own its genome.
It is tempting to give
non-profit researchers a lot of leeway around patents. However, the risk is that some such researchers will deliberately push the envelope between running research studies and running cut-rate genetic testing shops. Careless changes to the law could also hurt companies selling patented technologies used in research: if a researcher can ignore patents for genetic tests, why not for any other patented technologies.

Gene patents, like all patents, are an attempt by government (with a concept enshrined in the U.S. Constitution) to encourage innovation yet also enable further progress. There should be a constant debate as to how to achieve this. Ideas such as 'bills of rights', research exemptions, the definitions of obviousness and prior art, and many other topics need to be hashed over. But please, please, think carefully before throwing a huge stone, or volley of gravel, into the pool of intellectual property law.

Wednesday, November 22, 2006

Enjoy Your W!

No, this isn't revealing my politics. W is the single letter code for tryptophan, which is reputedly richly found in turkey meat (If you are a vegetarian, what vegetable matter is richest in W?)

Tryptophan is an odd amino acid, and probably the last one added to the code -- after all, in most genomes only a single codon and it is one of the rarest amino acids in proteins. It has a complex ring system (indole), which would also suggest it might have come last.

Why bother? Good question -- one I should know the answer to but don't. When W is conserved, is the chemistry of the indole ring being utilized where nothing else would do? That's my guess, but I'll need to put that on the list of questions to figure out sometime.

But why W? Well, there are 20 amino acids translated into most proteins (plus a few others in special cases) and they were named before anyone thought their shorthand would be very useful. There are clearly mneumonic three letter codes, but for long sequences a single letter works better -- once you are indoctrinated in them, the three letter codes become nice compact representations which can be scanned by eye. Some have even led to names of domains, such as WD40, PEST and RGD.

The first choice is to use the first letter of the name: Alanine, Cysteine, Glycine, Histidine, Isoleucine, Leucine, Methionine, Proline, Serine, Threonine and Valine follow this rule. If multiple amino acids start with the same letter, the smallest amino acid gets to take the letter. Some others are phonetically correct: aRginine, tYrosine, F:Phenylalanine. Others just fill in D:aspartate, E:glutamate, K:lysine, N: asparagine and Q:glutamine.

But Tryptophan? Perhaps it was studied early on by Dr. Fudd
, who gave long lectures about the wonders of tWiptophan.

What's a good 1Gbase to sequence?

My newest Nature arrived & has on the front a card touting Roche Applied Science's 1Gbase grant program. Submit an entry by December 8th (1000 words), and you (if you reside in the U.S. or Canada) might be able to get 1Gbase of free sequencing on a 454 machine. This can be run on various number of samples (see the description). They are guaranteeing 200bp per read. The system runs 200Kreads per plate and the grant is for 10 plates -- 2M reads -- but 2M x 200 bases = 400Mb -- so somewhere either I can't do math or their materials aren't quite right. The 200bp/read is a minimum, so apparently their average is quite a bit higher (or again, I forgot a factor somewhere). Hmm, paired end sequencing is available but not required, so that isn't the obvious factor of 2.

So what would you do with that firepower? I'm a bit embarassed that I'm having a hard time thinking of good uses. For better-or-worse, I was extended at Millennium until the end-of-year, so any brainstorms around cancer genomics can't be surfaced here. There are a few science-fair like ideas (yikes! will kids soon be sequencing genomes as science fair projects?), such as running metagenomics on slices out of a Winogradski column. When I was an undergrad, our autoclaved solutions of MgCl2 always turned a pale green -- my professor said this was routine & due to some alga that could live in that minimal world. Should that be sequenced? Metagenomics of my septic system? What is the most interesting genetics system that isn't yet undergoing a full genome scan?

Well, submission is free -- so please submit! Such an opportunity shouldn't be passed up lightly.

Thursday, November 16, 2006

One Company I'm Not Sending My Resume To

On some site today a Google ad caught my eye -- alas I cannot remember why -- and I found a site for NEXTgencode. However, it doesn't take long to realize that this isn't a real company . The standard links for a real biotech company, such as 'Careers', 'Investors', etc. are missing.

Clicking around what is there leads to a virtual Weekly World News of imaginative fabrications, though some have previously been presented as true by places that should know better, such as the BBC. I think I saw an item on grolars (grizzly-polar bear hybrids) in the press as well. Also thrown in is a reference to the recently published work: "Humans and Chips Interbred Until Recently".

Various ads show products in development. My favorite ad is the one for Perma Puppies, which never grow old or even lose their puppy physique (though their puppy isn't nearly as cute as mine was!). There's also the gene to buy with the HUGO symbol BLSHt.

The giveaway is the last news article, which describes a legal action by the company
Michael Crichton's book "Next" claims to be fiction, but its story line reveals proprietary informaiton of Nextgencode, a gene manipulation company.

Surprise! "Next" will be released at the end of the month.

When I read Andromeda Strain as a kid, I fell hook, line & sinker for a similar ploy in that book -- all of the photos were labeled just like the photos of real spacecraft in books on NASA ("Photo Courtesy of Project SCOOP"). It took some convincing from older & wiser siblings before I caught on.

Going back over the news items with the knowledge of who is behind it was revealing. Crichton has become noted for throwing his lot in with global warming skeptics. "Burn Fuel? Backside Fat Powers Boat" is the tamer of the digs; another item suggests Neanderthals were displaced by the Cro-Magnon due to the Neaderthals environmentalist tendencies.

Well, at least it's a tame fake -- a fake company purely to hawk a book. Sure beats the shameless hucksters who set up companies to peddle fake cures (we have stem cell injections to cure hypochondria!) to desperate patients.

Wednesday, November 15, 2006

Counting mRNAs

If you want to measure many mRNAs in a cell, microarray technologies are by far the winner. But for more careful scrutiny of the expression of a small number of genes, quantitative RT-PCR is the way to go. qRT-PCR is viewed as more consistent & has higher throughput (for lower cost) when looking at the number of samples which can be surveyed. It doesn't hurt that one specific qRT-PCR technology was branded TaqMan, which plays on both the source of the key PCR enzyme (Thermus aquaticus aka Taq) and the key role of Taq polymerase's exonuclease activity, which munches nucleotides in a manner reminiscent of a certain video game character (though I've never heard of any reagent kits being branded 'Power Pills'!).

RT-PCR quantitation relies on watching the time course of amplification. Many variables can play with amplification efficiencies, including buffer composition, primer sequence, and temperature variations. As a result, noise is introduced and results between assays are not easily comparable.

The PNAS site has an interesting paper which uses a different paradigm for RT-PCR quantitation. Instead of trying to monitor amplification dynamics, it relies on a digital assay. The sample is diluted and then aliquoted into many amplificaiton chambers. At the dilutions used, only a fraction of the aliquots will contain a single template molecule. By counting the number of chambers positive for amplification & working back from the dilution, the number of template molecules in the original sample can be estimated.

Such digital PCR is very hot right now and lies at the heart of many next generation DNA sequencing instruments. What makes this paper particularly interesting is that the assay has been reduced to microfluidic chip format. A dozen diluted samples are loaded on the chip, which then aliquots each sample into 1200 individual chambers. Thermocycling the entire chip drives the PCR, and the number of positive wells are counted. While the estimate is best if most chambers are empty of template (because then very few started with multiple templates), the authors show good measurement at higher (but non-saturating) template concentrations.

An additional layer of neato is here as well -- each sample is derived from a single cell, separated from its mates by flow sorting. While single cell sensitivity has been achieved previously, the new paper claims greater measurement consistency. By viewing individual cells, misunderstandings created by looking at populations are avoided. For example, suppose genes A and B were mutually exclusive in their expression -- but a population contained equal quantities of A-expressors and B-expressors. For a conventional expression analysis, one would just see equal amounts of A and B. By looking at single cells, the exclusive relationship would become apparent. The data in this paper show examples of wide mRNA expression ranges for the same gene in the 'same' type of cells; a typical profile of the cell population would see only the weighted mean value.

The digital approach is very attractive since it is counting molecules. Hence, elaborate normalization schemes are largely unnecessary (though the Reverse Transcriptase step may introduce noise). Furthermore, from a modeler's perspective actual counts are gold. Rather than having fold-change information with fuzzy estimates of baseline values, this assay is actually enumerating mRNAs. Comparing the expression of two genes becomes transparent and straightforward. Ultimately, such measurements can become fodder for modeling other processes, such as estimating protein molecule per cell counts.

Cell sorters can also be built on chips (this is just one architecture; many others can be found in the Related Articles for that reference). It doesn't take much to imagine marrying the two technologies to build a compact instrument capable of going from messy clinical samples to qRT-PCR results. Such a marriage might one day put single cell qRT-PCR clinical tests into a doctor's office near you.

Tuesday, November 14, 2006

More lupus news

Hot on my previous rant around lupus is some more news. Human Genome Sciences has announced positive results for its Lymphostat B drug in lupus. I won't go into detail on their results, other than to comment that the study size is large (>300), the trial is a Phase II double-blind placebo controlled trial (open label, single arm trials are much more common for Phase II -- HGS isn't taking the easy route) but these results haven't yet been subject to full peer review in a journal article.

Lymphostat B has a number of unusual historical notes attached to it. It is in that very rarified society of discoveries from genomics which have made it far into therapeutic clinical trials -- there are other examples (not on hand, but trust me on this!), but not many. It doesn't hurt that it was in a protein family (TNF ligands) which suggested a bit of the biology (e.g. a cognate receptor) & has led to quite a bit of biology which is in the right neighborhood for a lupus therapy (B-cell biology); most genomics finds were Churchillian enigmas.

Second, this is a drug that initially failed similar trials -- but HGS conducted a post-hoc subset analysis on the previous trial. However, instead of begging their way forward (such analyses get all the respect due used cat litter, but that doesn't stop desperate companies from trying to argue for advancement) they designed a new trial using a biomarker to subset the population. If their strategy works, it is likely that doctors will only prescribe it to this restricted population. HGS has, in effect, decided it is better to treat some percent of a small population than risk getting approval for 0% of a larger one -- a bit of math the pharmaceutical industry has frequently naysayed.

HGS and their partner GSK still have a long way to go on Lymphostat B. Good luck to them -- everyone in this business needs it, especially the patients.

Monday, November 13, 2006

Small results, big press release

The medical world is full of horrible diseases which need tackling, but you can't track them all. For me, it is natural to focus a touch more on those to which I have a personal connection.

Lupus is one such disease, as I have a friend with it. Lupus is an autoimmune disease in which the body produces antibodies targeting various normal cellular proteins. The result can be brutal biological chaos.

The pharmaceutical armamentarium for lupus isn't very good. Anti-lupus therapies fall into two general categories: anti-inflammatory agents and low doses of cancer chemotherapeutics (primarily anti-metabolite therapies such as methotrexate). Few of these have been adequately tested in lupus, and certainly not well tested in combination. The docs are flying by the seat of their pants. The side effects of the drugs are quite severe, so much so that lupus therapy can be an endless back-and-forth between minimizing disease damage & therapy side effects.

One reason lupus hasn't received a lot of attention from the pharmaceutical industry is that we really don't understand the disease. It is almost certainly a 'complex disease', meaning there are multiple genetic pathways that lead to or influence the disease. Different patients manifest the disease in different ways. For many patients, the most dangerous aspect is an autoimmune assault on the kidneys. but for my friend the most vicious flare-ups are pericarditis, an inflammation of the sac around the heart. These differences could reflect very different disease mechanisms; we really don't know.

We need to understand the mechanisms of lupus, so it is with interest I read items such as this one: New biomarkers for lupus found. The item starts promisingly

A Wake Forest University School of Medicine team believes it has found biomarkers for lupus that also may play a role in causing the disease.

The biomarkers are micro-ribonucleic acids (micro-RNAs), said Nilamadhab Mishra, M.D. He and colleagues reported at the American College of Rheumatology meeting in Washington that they had found profound differences in the expression of micro-RNAs...

So far, so good -- except now things go south
...between five lupus patients and six healthy control patients who did not have lupus.

Five patients? Six controls? These are exquisitely tiny samples, particularly when looking at microRNAs, of which there are >100 known for human. With so few samples, the risk of a chance association is high. And are these good comparisons? Were the samples well matched for age, concurrent & previous therapies, gender, etc?

Farther down is even more worrisome verbiage
In the new study, the researchers found 40 microRNAs in which the difference in expression between the lupus patients and the controls was more than 1.5 times, and focused on five micro-RNAs where the lupus patients had more than three times the amount of the microRNAs as healthy controls, and one, called miR 95 where the lupus patients had just one third of the gene expression of the microRNA of the controls.

Fold-change cutoffs are popular in expression studies, because they are intuitive, but are generally meaningless. Depending on how tight the assays are, fold changes of 3X can be meaningless (in an assay with high technical variance) and ones smaller than 1.5X can be quite significant (in an assay with very tight technical variance). Well-designed microarray studies are far more likely to use proper statistical tests, such as T-tests.

And one last statement to complain about
The team reported the lesser amount of miR 95 "results in aberrant gene expression in lupus patients."

Is this simply correlation between miR 95 and other gene expression -- which suffers both from the fact that correlation is not causation and that with such small samples gene expression differences will be found from pure chance. Are these genes which have previously been shown to be targets of miR 95? Has it been shown that actually interfering with miR 95 expression in the patient samples reverts the gene expression changes?

Of course, it is patently unfair for me to beat up on a scientific poster of preliminary results for which I have only seen a press release - one hopes that before this data gets to press a much more detailed workup is performed (please, please let me review this paper!). But, it is also patently unfair to yank the chains of patients with understudied diseases with press releases that take a nub of a preliminary result and headline it into a major advance.

Friday, November 10, 2006

Systems Biology Review Set

The Nature publishing group has made a set of reviews on systems biology available for free. It looks like an interesting set of discussions by key contributors.

Hairy Business

In addition to the sea urchin genome papers, the new Science also contains an article describing the positional cloning of a mutant gene resulting in hair loss. The gene encodes an enzyme which is now presumed to play a critical role in the health of hair follicles.

The first round of genomics companies had two basic scientific strategies. Companies such as Incyte and Human Genome Sciences planned to sequence the expressed genes & some how sift out the good stuff. Another set of companies, such as Millennium, Sequana, Myriad and Mercator planned to find important genes through positional cloning. Positional cloning uses either carefully collected human family samples or carefully bred mice to identify regions of the genome that track with the trait of interest. By progressively refining the resolution of the genetic maps, the work could narrow down the region to something that could be sequenced. Further arduous screening of the genes in that region for mutations which tracked with the trait would eventually nail down the gene. Prior to the human genome sequence this was a long & difficult process, and sometimes in the end not all the ambiguity could be squeezed out. It is still serious work, but the full human genome sequence and tools such as gene mapping chips make things much cheaper & easier.

Instead, it seemed like every one of the positional cloning companies picked new indications -- obesity, diabetes, depression, schizophrenia, etc. -- and generally the same ones. This set up heated rivalries to collect families, find genes, submit patents & publish papers. Sequana & Millennium were locking horns frequently when I first showed up at the latter. If memory serves, on the hotly contested genes it was pretty much a draw -- each sometimes beating the other to the prize.

Eventually, all of the positional cloning companies discovered that while they could achieve scientific success, it wasn't easy to convert that science into medical reality. Most of the cloned genes turned out to be not easily recognizable in terms of their function, and certainly not members of the elite fraternity of proteins known as 'druggable targets' -- the types of proteins the pharmaceutical industry has had success at creating small molecules (e.g. pills) to target. A few of the genes found were candidates for protein replacement therapy -- the strategy which has made Genzyme very rich -- but these were rare. Off-hand, I can't think of a therapeutic arising from one of these corporate positional cloning efforts that even made it to trials (anyone know if this is correct?).

Before long, the positional cloning companies either moved into ESTs & beyond (as Millennium did) or disappeared through mergers or even just shutting down.

I'm reminded of all this by the Science paper because hair loss was one area that wasn't targeted by these companies -- although the grapevine said that every one of them considered it. The commercial success of Rogaine made it an attractive area commercially, and there was certainly a suggestion of a strong genetic component.

If a company had pursued the route that led to the Science paper, it probably would have been one more commercial disappointment. While the gene encodes an enzyme (druggable), the hairless version is a loss-of-function mutant -- and small molecules targeting enzymes reduce their function. The protein isn't an obvious candidate for replacement therapy either. So, no quick fix. The results will certainly lead to a better understanding of what makes hair grow, but only after lots of work tying this gene into a larger pathway.

As for me, I'm hoping I inherited my hair genes from my maternal grandfather, who had quite a bit on his head even into his 90's, rather than my father's side, where nature is not quite so generous. As for urchins, I learned to avoid them after a close encounter on my honeymoon. I was lucky the hotel had a staff doctor, but I discovered on my return to the States that we had missed one & perhaps I still carry a little product of the sea urchin genome around in my leg.

Thursday, November 09, 2006

Betty Crocker Genomics

It is one thing to eagerly follow new technologies and muse about their differences, it is quite different to be in the position of playing the game with real money. In the genome gold rush years it was decided we needed more computing power to deal with searching the torrent of DNA sequence data, and so we started looking at the three then-extant providers of specialized sequence analysis computers. But how to pick which one, with each costing as much as a house?

So, I designed a bake-off: a careful evaluation of the three machines. Since I was busy with other projects, I attempted to define strict protocols which each company would follow with their own instrument. The results would be delivered to me within a set timeframe along with pre-specified summary information. Based on this, I would decide which machines met the minimal standard and which was the best value.

Designing good rules for a bake-off is a difficult task. You really need to understand your problem, as you want the rules to ensure that you get the comparative data you need in all the areas you need it, with no ambiguity. You also want to avoid wasting time drafting or evaluating criteria that aren't important to your mission. Of key importance is to not unfairly prejudice the competition against any particular entry or technology -- every rule must address the business goal, the whole business goal, and nothing but the business goal.

Our bake-off was a success, and we did purchase a specialized computer which sounded like a jet engine when it ran (fans! it ran hot!) -- but it was tucked away where nobody would routinely hear it. The machine worked well until we no longer needed it, and then we retired it -- and not long after the manufacturer retired from the scene, presumably because most of their customers had followed the same path as us.

I'm thinking about this now because a prize competition has been announced for DNA sequencing, the Archon X Prize. This is the same organization which successfully spurred the private development of a sub-orbital space vehicle, SpaceShip One. For the Genome X Prize, the basic goal is to sequence 100 diploid human genomes in 10 days for $1 million.

A recent GenomeWeb article described some of the early thoughts about the rules for this grand, public bake-off. The challenges in simply defining the rules are immense, and one can reasonably ask how they will shape the technologies which are used.

First off, what exactly does it mean to sequence 100 human genomes in 1 week for $1 million? Do you have to actually assemble the data in that time frame, or is that just the time to generate raw reads & filter them for variations? Can I run the sequencer in a developing country, where the labor & real estate costs are low? Does the capital cost of the machine count in the calculation? What happens to the small gaps in the current genome? To mistakes in the current assembly? To structural polymorphisms? Are all errors weighted equally, and what level is tolerable? Does every single repeat need to be sequenced correctly?

The precise laying down of rules will significantly affect which technologies will have a good chance. Requiring that repeats be finished completely, for example, would tend to favor long read lengths. On the other hand, very high basepair accuracy standards might favor other technologies. Cost calculation methods can be subject to dispute (e.g. this letter from George Church's group).

One can also ask the question as to whether fully sequencing 100 genomes is the correct goal. For example, one might argue that sequencing all of the coding regions from a normal human cell will get most of the information at lower cost. Perhaps the goal should be to sequence the complete transcriptomes from 1000 individuals. Perhaps the metagenomics of human tumors is what we really should be shooting for -- with appropriate goals for extreme sensitivity.

Despite all these issues, one can only applaud the attempt. After all, Consumer Reports does not review genomics technologies! With luck, the Genome X Prize will spur a new round of investment in genomics technologies and new companies and applications. Which reminds me, if anyone has Virgin Galactic tickets they don't plan to use, I'd be happy to take them off your hands...

Urchin Genome

The new Science has a paper reporting the sequence of a sea urchin genome, as well as articles looking at specific aspects. This is an important genome, since it is the first echinoderm sequenced and echinoderms share many key developmental aspects as vertebrates.

At ~860 Mb the urchin genome is a bit larger than the Fugu (pufferfish) genomes which have been sequenced but substantially smaller than mammalian genomes (generally around 4000 Mb). With fast, cheap sequencing power on the horizon, soon all our favorite developmental models will have their genomes revealed.

Tuesday, November 07, 2006

Long enough to cover the subject, short enough to be interesting.

That was the advice my 10th grade English teacher passed on when asked how much we should produce for a writing assignment. The context (a woman's skirt) he gave was risque enough to get a giggle from 10th graders of the 80's; probably the same joke would get him in serious hot water today -- unless perhaps he pointed out that the same applies for a man's kilt.

A letter in a recent Nature suggests that the same question that vexed me in my student days also bedevils the informatics world. The writer lodges a complaint against MIAME (Minimum Information About a Microarray Experiment), a standard for reporting the experimental context of a microarray experiment. MIAME attempts to capture some key information, such as what the samples are and what was done to them.

The letter writer's complaint is that this is all a fool's mission, as one cannot possibly capture all the key information, especially since what is key to record keeps changing. All reasonable points.

The solution proposed made me re-read the letter for a hint of satire, but I'm afraid they are dead serious.
How should we proceed? Reducing the costs of microarray technology so that experiments can be readily reproduced across laboratories seems a reasonable approach. Relying on minimal standards of annotation such as MIAME seems unreasonable, and should be abandoned.

At first, this just seems like good science. After all, the acid test in science is replication by an independent laboratory.

This utterly ignores two facts. First, by depositing annotated data in central databanks the data can be mined by researchers who don't have access to microarray gear. Second, most interesting microarray experiments involve either specialized manipulations (which only a few labs can do) or very precious limited samples (such as clinical ones); replication would be nice but just can't be done on those same samples.

This "the experiments will be too cheap to database" argument has come up before; I had it sent my way during a seminar in my graduate days. But, like electricity too cheap to meter, it is a tantalizing mirage which fades on close inspection.

Monday, November 06, 2006

From biochemical models to biochemical discovery


An initial goal of genome sequencing efforts was to discover the parts lists for various key living organisms. A new paper in PNAS now shows how far we've come in figuring out how those parts go together, and in particular how discrepancies between prediction & reality can lead to new discoveries.

E.coli has been fully sequenced for almost 10 years now, but we still don't know what all the genes do. A first start would be to see if we could explain all known E.coli biology in terms of genes of known function -- if true, that would say the rest are either for biology we don't know or are for fine-tuning the system beyond the resolution of our models. But if we can't, that says there are cellular activities we know about but haven't yet mapped to genes.

This is precisely the approach taken in Reed et al. First, they have a lot of data as to which conditions E.coli will grow on, thanks to a common assay system called Biolog (a PDF of the metabolic plate layout can be found on the Biolog website -- though curiously marked "Confidential -- do not circulate"!). They also have a quantitative metabolic model of E.coli. Marry the two and some media that support growth cannot be explained -- in other words, E.coli is living on nutrients it "shouldn't" according to the model.

Such a list of unexplained activities is a set of assays for finding the missing parts of the model, and deletion strains of E.coli provide the route to which genes plug the gaps. If a given deletion strain fails to grow in one of the unexplained growth-supporting media, then the gene deleted in that strain is probably the missing link. The list of gene to test can be made small by choosing based on the model -- if the model is missing a transport activity, then the intial efforts can focus on genes predicted to encode transporters. Similarly, if the model is missing an enzymatic reaction one can prioritize possible enzymes. The haul in this paper was to assign functions for 8 more genes -- a nice step forward.

It is sobering how much of each genome which has been sequenced is of unknown function, even in very compact genomes. Integration of experiment and model, as illustrated in this paper, is our best hope for closing that gap.

Friday, November 03, 2006


Protein phosphorylation is a hot topic in signal transduction research. Kinases can add phosphate groups to serines, threonines & tyrosines (and very rarely histidines), and phosphatases can take them off. These phosphorylations can shift the shape of the protein directly, or create (or destroy) binding sites for other proteins. Such bindings can in turn cause the assembly/disassembly of protein complexes, trigger the transport of a protein to another part of the cell, or lead to the protein being destroyed (or prevent such) by the proteasome. This is hardly a comprehensive list of what can happen.

Furthermore, a large (by some estimates 1/4 to 1/5) amount of the pharmaceutical industries efforts, including those at my (soon to be ex-) employer Millennium, are targeting protein kinases. If you wish to drug kinases, you really want to know what the downstream biology is and that starts with what does your kinase phosphorylate, when does it do it, and what events do those phosphorylations trigger.

A large number of methods have been published for finding phosphorylation sites on proteins, but by far the most productive have been mass spectrometric ones (MS for short). Using various sample workup strategies, cleverer-and-cleverer instrument designs, and better software, the MS folks keep pushing the envelope in an impressive manner.

The latest Cell has the latest leap forward: a paper describing 6,600 phosphorylation sites (on 2,244 proteins). To put this in perspective, the total number of previously published human phosphorylation sites (by my count) was around 12,000 -- this paper has found 50% as many as were previously known! Some prior papers (such as these two examples) had found close to 2,000 sites.

Now some of this depth came from many MS runs -- but that in itself illustrates how this task is getting simpler; otherwise so many runs wouldn't be practical. The multiple runs also were used to gather more data: looking at phosphorylation changes (quantitatively!) over a timecourse.

One this this study wasn't designed to do is clearly assign the sites to kinases. Bioinformatic methods can be used to make guesses, but without some really painful work you can't really make a strong case. And if the site shouldn't look like any pattern for a known kinase -- good luck! There really aren't great methods for solving this (not to say there aren't some really clever tries).

Also interesting in this study is the low degree of overlap with previous studies. While the reference set they used is probably quite a bit lower than the 12K estimate I give, it is still quite large -- and most sites in the new paper weren't found in the older ones. There are in excess of 20 million Ser/Thr/Tyr in the proteome and many are probably not phosphorylated, but certainly a reasonable estimate would be north of 20K are.

For drug discovery, the sort of timecourse data in this paper is another proof-of-concept of the idea of discovering biomarkers for your kinase using high-throughput MS approaches (another case can be found in another paper). By pushing for so many sites, the number of candidates goes up substantially, since many sites found aren't modulated in an interesting way, at least in terms of pursuing a biomarker. This is noted in Figure 3 -- for the same protein, the temporal dynamics of phosphorylation at different sites can be quite different.

However, it remains to be seen how far into the process these MS approaches can be pushed. Most likely, the sites of interest will need to probed with immunologic assays, as previously discussed.

Thursday, November 02, 2006

Metagenomics backlash.

Metagenomics is a burgeoning field enabled by cheap sequencing firepower -- which grows cheaper each year. You take some interesting microbial ecosystem (such as your mouth, a septic tank, the Sargasso sea), perform some minimal prep, and sequence everything in the soup. The results find everything in the sample, not just what you can culture in a dish.

Now in Nature we can see the backlash -- angry microbiologists irked at uneducated oafs stomping their turf. One complaint (scroll to the bottom) is the oft used term "unculturable species" -- i.e. the new stuff that metagenomics discovers. Quite appropriately, the microbiologists cry foul on this aspersion against their abilities, as the beasties aren't unculturable, just haven't been cultured yet.

The new letter says 'Amen' and goes on to gripe that sequencing unknown microbes is no way to properly discover biological diversity, only culturing them will do.

IMHO, a lot of this is the usual result of new disciplines with eager, arrogant new members (moi?) wading into the domain of old disciplines. According to my microbiology teaching assistant, a molecular biologists is defined as "someone who doesn't understand the biological organism they are working with". Similar issues of "hey, who's muscling in on my turf?" beset chemistry, as illustrated in this item from Derek Lowe's excellent medicinal chemistry blog.

These sorts of spats have some value but aren't terribly fun to watch. Worse, the smoke & dust from them can obscure the real common ground. There is already at least one example of using genome sequence data to guide culture medium design. Perhaps future metagenomic microbiologists will make this standard practice.

Wednesday, November 01, 2006

In vivo nanobodies.

The new print issue of Nature Methods showed up, and it is rare for this journal not to have a cool technology or two in it. If you are in the life sciences, you can generally get a free subscription to this journal.

Antibodies are cool things, but also complex molecular structures. They are huge proteins composed of 2 heavy chains & 2 light chains, all held together by disulfide linkages. Expressing recombinant antibodies is not a common feat -- it is very hard to do so given the precise ratio & the folding required. Trying to express them inside the cytoplasm would be even trickier, as the redox potential won't let those disulfides form.

Camels & their kin, however, have very funky antibodies -- only a single heavy chain. I've never come across the history of how these were found -- presumably some immunologist sampling all mammals to look for wierd antibodies. Because of this structure, they are much smaller & don't require disulfide linkages. In fact, the constant parts of the camelid antibody can be lopped off as well, leaving the very small variable region, termed a nanobody.

The new paper (subscription required for full text, alas) describes fusing nanobodies to fluorescent proteins & then expressing them in vivo. Since only a single chain is needed, the nanobody coding region can be PCRed out & fused to your favorite fluorescent protein. The paper shows that when expressed in cells, these hybrids glow just where you would expect them to. The ultimate vital stain for any protein or modification! With multiple fluorescent protein of different colors, multiplexing is even theoretically possible (though not approached in this paper).

Of course, one is going to need to generate all those nanobodies. There is already a company planning to commercialize therapeutic nanobodies (ablynx). Perhaps another company will specialize in research tool nanobodies -- ideally without the nanoprofits and nanoshareprices which are all too common in biotechnology!