Friday, March 30, 2007

Trying to Gulp Through A Coffee Stirrer

I'm not a big coffee drinker -- indeed I come from a long line of not big coffee drinkers -- but I do often raid various coffee establishments for other goodies, and in the cold weather (perhaps behind us for a while) that includes hot cocoa.

Of course, greedily drinking a hot beverage can be the route to a host of unpleasant effects. Sometimes a straw makes sense, but many such places just have the little hollow plastic coffee stirrers. They might faintly resemble straws, but the exercise of trying to use one in that way is mostly an exercise in frustration, as the throughput just isn't sufficient.

I had a similar feeling today at work. It had occurred to me that the some data I wanted to explore was probably out there on the internet somewhere -- and indeed it is in a dozen or so 'boutique' databases. One I found had a nice prominent download link, and in short order the whole dataset was on my machine & parsed into the form I needed.

However, no such luck with any other database. They all have decent web front ends, but the last thing I want to do is browse the data one record at a time. I'm not even sure the precise data I want is in the database -- how many records should I browse before giving up? And since what I want is an atypical small subset of the data, it isn't surprising the web interface really doesn't support my query.

Anyone who curates data & makes it available deserves applause, and I hate to sound ungrateful. But could you please make a flatfile dump available? Someone might just want to use your data in a way you didn't imagine.

Thursday, March 29, 2007

454? How Roche!

Today's GenomeWeb bears the news that Roche Diagnostics is buying out 454 Life Sciences. Since Roche was previously the sole distributor of 454's sequencers and Curagen had announced their desire to sell the subsidiary, this is hardly a shocking development. But it is the third next generation sequencing company to be bought by an established player -- ABI slurped up Agencourt Personal Genomics and Illumina recently bought Solexa. So far, Affymetrix and Agilent have stayed out -- as has Nimblegen. There are plenty of other startup next generation sequencing shops out there, and certainly other candidates for acquirers. Roche, of course, got the clear current front runner, though it may be that the next wave of sequencer launches will close the gap quickly.

Whether these acquisitions are good for next generation sequencer development is an open question. On the one hand, these larger organizations bring deep pockets and substantial marketing expertise. But, there are plenty of pitfalls. For both ABI and Illumina, the new machines compete with their old machines -- smart companies see this as inevitable, but many companies completely botch the job due to internal conflicts (as amply documented by Clayton Christiansen in his books). It isn't encouraging that the Agencourt Personal Genomics technology is impossible to find on the ABI website.

It will also be interesting to see how long the 454 moniker lasts -- one hates to see pioneers go, but on the other hand I find naming a subsidiary after the accounting code tres gauche.

An interesting note in the GW item is that Roche was previously prohibited from marketing regulated diagnostics built on the 454 platform. Roche has previously tried to launch some molecular diagnostics -- the D word is after all in their name -- so this is a clear fit. On the other hand, a run on the 454 is reputed to be serious money, so they'll need to either find a very high value application (in a field notorious for antiquated, miserly reimbursement rules) or figure out a way to run lots of tests simultaneously. Given the rather long read lengths of the 454, one approach to the latter would be to use sequence tags near the beginning of the read to identify the original samples.

Another GW item describes some roundtable discussion at a recent meeting on next generation sequencing. The price for a genome in 2010 is still a big question, but a lot of bets are apparently in the $10K-$25K range. Some of the leaders in the field are taking a realistic view of the utility of such sequencers at such a price tag -- if you can scan the most informative SNPs for $1K, then why sequence? I'm guessing that other than a few pioneers (J.Craig is apparently resequencing his genome), there won't be a lot at those prices. On the other hand, cancer genomics is a natural fit, as each genome is different (indeed, each sample probably has many distinguishable genomes) and understanding all the fine molecular details will be valuable. SNP chips can estimate copy numbers, but not tell you how those pieces are stitched together nor find all the interesting mutations.

Even with the price at $1K, sequencing will certainly not be 'too cheap to meter'. Notions of sequencing a big chunk of the human population have appeal, but do we really want to blow another few billion dollars on human sequencing? On the other hand, as I've suggested before, other mammalian genomes may provide a lot of interesting biology for the buck (or bark). What are the most interesting unbagged genomes out there -- that sounds like the topic for another day's post...

Wednesday, March 28, 2007

Tiny Tug-of-War

I never got past introductory physics in college, particularly since I put off taking it until my senior year. Many of the basic concepts had shown up every so many years in grade school, but college really tied it together & for a brief time I could run the equations in my sleep. Lots of the problems involve springs and pulleys and other simple mechanical gadgets.

Last week's Nature contains a paper which is in the growing field of doing experiments on springs and other simple machines -- except here the gadgets are biomolecular machines, in this case the E.coli ribosome. The technologies are a bit fancier than what we had in grade school -- optical traps and such -- but in the end the desired measurements are similar -- what force does it take to balance (or overcome) a force within the molecular machine.

A huge book on my father's bookshelf was The Handbook of Chemistry and Physics, which had all sorts of tables of useful measured values and derived constants. I have the 1947 edition on my bookshelf, a present from one of my father's friends -- though I confess I've never done more than flip the pages of either one. There must be an electronic, molecular biological equivalent out there with the sort of data from this paper, but I don't know where it is. I could use it periodically -- I was recently trying to find rate and accuracy figures for RNA polymerase and the ribosome, and it isn't easy to do & I'm not sure I trust what I've found.

Thursday, March 22, 2007

Error Will Robison

After my post on the mental challenge of juggling multiple programming languages, I realized another reason I like to stick to a few: grokking the error messages.

In an ideal world the various messages kicked out from ill-formed or ill-performing code would always precisely and instantly finger the exact problem -- in which case the programming environment would just go fix them. Some programming languages do try to assist you a lot. For example, Perl often guesses that you really didn't mean to have a quoted string run over many lines, and thereby shows you where the long quote starts. Similarly, it will often suggest that a semicolon addition that might cure the problem.

But not always are the error messages on the mark -- sometimes the wrong quote is really a bit before where it points out, or a missing semi-colon is not the problem. Worse, when the Perl interpreter chokes on a program, it generally spits out one of two error messages: Out of Memory or Segmentation Fault. In either case one goes looking for an inadvertant infinite loop or endless recursion (the Perl debugger catches these quite often with a more informative message). One common Perl trap is the one letter deletion which converts nicely behaving code such as:

while (/([A-Z])/g) { push(@array,$1); }

into
while (/([A-Z])/) { push(@array,$1); }


Another gotcha (or should I say, gotme) is the wrong loop end condition
for ($i=10; $i>0; $i++ { print "$i\n"; }
. Again, the symptom tends to be running out of memory on a trivial task.

My problems don't tend to call for much recursion, so when I get 100 levels in it must be a mistake. Most commonly it is due to a botched lazy initiation -- a scheme by which some complex object doesn't set up internal states until they are asked for. I do this a lot right now as I have objects representing complex data collections stored in a relational database, and it doesn't make time sense to slurp every last piece of data from the database when you only want a few. However, one must be careful:


sub getId
{
my ($this)=@_;
unless (defined $this->{'foo'})
{
$this->createFoo();
}
}

sub createFoo
{
my ($this)=@_;
my $id=$this->getId(); # round-and-round we go!
}


My errors with R tend to fall into a small number of categories, and the error messages are generally informative. Out of memory means I really did blow out memory. Trivial syntax errors (changing the assignment <- to <= or =), passing nulls (or no data) to something which doesn't care for it, etc.

On the other hand, I'm glad that I don't do a lot of Oracle (SQL) programming, or at least a wide variety of it, because the error messages there are as clear as mud to me. Luckily, there is a small number of mistakes I make; probably 95% fall into : misspelling a table name or alias, misspelling a column name, letting Perl-isms slip in ($column), missing commas, and extraneous commas. The only that is SQL-specific is botching GROUP BY columns and functions. The only runtime errors I tend to get are either minor hiccups from database inconsistencies or queries that never seem to return because of a botched join.

It looks like I might have a real need to learn C#, which means reverting a bit of a decade (when I used C++). Learning the language is one thing; learning the hidden language of error messages always takes a lot longer

Monday, March 19, 2007

Personalized Medicine: The long slog

Personalized medicine is a wonderful concept: instead of lumping huge groups of patients with similar symptoms together to be treated with a standard regimen, therapy would be tailored to each patient based on the specifics of their disease. This fine-grained diagnosis would be dtermined using the fruits of the human genome project.

In some sense this is simply an attempt to accerate the long-term trend in medicine of subdividing diseases. From four humors we have moved to a myriad of diseases. In a more specific sense, consider leukemia. In the 1940's, when my paternal grandmother succumbed to this disease, there were (as far as I can tell) less than a half dozen recognized leukemia subtypes; these days there are certainly over one hundred. This is not idle splitting; each disease has its own diagnostic hallmarks, treatment strategies, and outcome expectations. Great (but not universal) success has been achieved with childhood leukemias, whereas some other leukemias are still very grim sentences.

To realize the dream of personalized medicine is going to require a lot of hard work, both in the lab and in the clinic. I'm going to go into some detail on one such endeavor, one which I am very familiar with because I was peripherally involved with it. Now, in the interest of full disclosure, it must be stated that I still retain a small financial interest in my former employer, Millennium Pharmaceuticals, and that several of the authors are good friends. However, it should also be pointed out that while Millennium once trumpeted every baby step towards personalized medicine, the electronic publication of this story engendered no press release. If the company thinks it can't perk up its share price with the story, there is faint reason to think I can.

Multiple myeloma is a malignancy of the antibody secreting cells, the plasma B cells. Two famous victims are the columnist Ann Landers and actor Peter Boyle; a well-known long-term survivor is former vice presidential candidate Geraldine Ferraro. Cancers are often loosely broken into two categories: "liquid" tumors such as leukemias and solid tumors. Myelomas occupy the mushy middle: while they are derangements of the immune system like leukemias, myelomas can form distinct tumors (plasmacytomas) in the body. A hallmark of the disease is bone destruction around the tumors; patients' X-rays can have a 'swiss-cheese' appearance.

Myelomas are a devastating disease, but also occupy an important place in biotech history. Because myelomas sprout from a single deranged antibody-secreting cell, the blood (and ultimately urine) of patients becomes full of a single antibody, the M-protein (also known historically as a Bence-Jones protein). A flash of inspiration led Koehler & Milstein to realize that if they could have that antibody be one of their choosing, then a limitless source of a specific antibody could be at hand. The monoclonal antibody technology which they invented led to a host of useful reagents and tools, including home pregnancy kits. The last decade has finally seen monoclonal antibodies become important therapeutic options, particularly in cancer, and a number are being tried on myeloma: a complete circle.

The drug of interest here is not an antibody but rather a small molecule: bortezomib, tradename Velcade and known in the older literature as MLN341, LDP341 or PS341. Bortezomib works like no other drug on the market: it blocks the action of a large complex called the proteasome. A key normal function of the proteasome is to serve as the cells main protein disposal system, chewing old or broken proteins back into amino acids. Destruction of proteins by the proteasome can also be a regulated process and appears to be a component of many genetic processes.

Bortezomib has been tried as a therapeutic agent, either alone or in concert with other drugs, against a wide array of tumors. It has disappointed often, still tantalizes in some areas, and has received FDA approval for two malignancies: multiple myeloma and another B-cell malignancy called mantle cell lymphoma.

Early in the clinical trial process Millennium decided to build a personalized medicine component into the main Velcade trials in multiple myleoma. The justification for this was a mix of different ideas: including a desire to show results in personalized medicine, a potential to use the personalized medicine element to support FDA approval should trial results be equivocal, an opportunity to understand why myelomas are sensitive to proteasome inhibition.

The design was both simple and audacious: in each trial patients would be asked to supply a bone marrow biopsy for analysis by RNA profiling, which can examine the levels of each gene's mRNA. It sounds simple; in practice this would use a cutting edge technology (RNA profiling) notorious for sensitivity to sample processing. It would also be the first use of such technology in a prospective clinical trial; prior publications had either used archived samples or new samples from available patient populations. Protocols would have to be devised, staff trained at each clinical center in a multi-center trial.

The results can now be seen in Blood as Mulligan et al. You will need paid access to the journal to read the details, which most large academic libraries should have. Also, the sponsors of Blood (American Society of Hematologists) have some mechanism for patient access -- and eventually (I think it is 6 months) they make everything free. The data supplement and methods supplement are free.

Table 1 gives you some hint why few companies will be eager to invest in this kind of study again, as it details how many samples actually made it to the analysis. One can envision the path from trial to data ready to analyze as a pipeline of many steps, each of which is leaky. Patients must consent, the myleoma fraction purified, RNA captured, arrays analyzed and finally useful survival data obtained. Patient consent refusals (or later paperwork deficiencies), poor samples, patients lost to follow-up, etc. eat into the starting material. Even good clinical luck can be problematic: one of the key bortezomib trials was halted early because the drug was clearly working better than the control drug. This was great news for patients, who needed (and still need) more treatment options, and great news for the company, which could more quickly obtain approval to sell the drug. But it both deprived the personalized medicine study of anticipated patients and muddied the waters on many others. For example, samples had been obtained from control arm patients, but now many of these patients were crossing over to bortezomib and were no longer useful controls.

How leaky was the pipeline? Four clinical studies had RNA profiling components (another complication; each study was on a different trial population, with different disease characteristics). Looking at evaluable survival (meaning the patient stayed in the study long enough to figure out if the drug helped them live longer or not): 13%, 22%, 23% and 22% of patients from the 4 trials (024, 025, 039 & 040 respectively) had data for evaluation.

On the other end, many studies were accumulating information that myleoma has many genetic subtypes: perhaps at least seven or so major ones, and many of these can be further subdivided. For example, one major translocation driving myleoma involves a gene called MMSET. In a subset of these patients, a second gene (FGFR3) is also activated by the translocation. Many other classical clinical measures are used by clinicians, such as albumin and CRP levels. A very interesting question would be whether bortezomib had greater or lesser activity in any of the subtypes (or sub-subtypes); but with the ferocious sample attrition, the sample numbers just aren't great enough to be able to draw conclusions. This also illustrates the power & problem of RNA microarrays: you can look at tens of thousands of genes, allowing you to find patterns with few preconceived biases. But, you are looking at tens of thousands of genes, so the multiple testing problem is very acute.

The other thing most frustrating about this study, as in a large number of RNA profiling studies, is that there is no Eureka! moment coming from the data. Gene sets were successfully identified which can predict response or survival, but what do they mean? The hope that RNA profiling would provide the Cliff's Notes to a tumor is a hope rarely realized; instead the tumor reveals a nearly inscrutable scrawl. The study succeeded scientifically, but commercially it was not a contributor.

This will probably be more the norm than the exception in the quest for personalized medicine. Huge investments will need to be made in large clinical studies, many of which won't bear fruit, at least immediately. Combined with other myleoma studies, the Mulligan et al study will enhance our knowledge of myleoma. The execution of the study provides a roadmap for other such studies. New technologies are available which weren't when these studies began. In particular, for cancer one might opt for DNA profiling to map the underlying genetic makeup of the tumor (greatly hashed), rather than RNA. While RNA is where the action really is, DNA is much more stable and therefore may lead to results more consistent between clinical sites. And once in a while, a study might just have results that have oncologists running through the streets, making the whole exercise worthwhile.

Wednesday, March 14, 2007

Cancer Kinases

Last week's Nature had a big paper from the Sanger surveying human protein kinase gene hunting for somatic mutations in cancer. The paper (and a News&Views item; alas, both require a subscription) has deservedly received a lot of press coverage, but a few notes.

First, it is important to underline that this is a first discovery step which associates these mutations with cancer, but it certainly can't guarantee that they are involved. Tumors generally have very battered genomes; indeed, the study noted more mutations in tumors likely to have undergone extensive mutagenesis (defects in DNA repair, smoking-related lung tumors, melanomas from skin exposure, tumors in patients treated with mutagenic oncologic drugs). A useful filter is to compare the ratio of synonymous to non-synonymous mutations, that is those mutations which do not change the amino acid coded in the message vs. those that do. Synonymous mutations should be (to the first approximation) not selected against, so they can be used to estimate the background mutation rate. If a non-synonymous mutation is seen more often than expected, it is inferred to have been selected as advantageous.

One interesting side observation is that in some tumors there is an excess (vs. random chance) of mutations at TC / GA dinucleotides (or TpC / GpA as written -- a common convention to specify that this means T followed by C and not any dinucleotide containing T and C). Such a pattern was not observed in germline (normal) samples from the same patients nor has it been observed previously, suggesting a tumor-specific mutational process.

Protein kinases are an obvious set to look at because so many are already known to implicated in cancer and drugs targeting kinases have already been found useful in the clinic. Indeed, this week the FDA gave the first approval for Tykerb, another small molecule drug targeting oncogenic kinases. The kinases found in the study include many kinases already well implicated in disease. For example, the same group had previously found BRAF mutated in many melanomas. I've discussed STK6 (AuroraA) in this space previously, and STK11 (LKB1) is a well studied tumor suppressor. But there are some interesting surprises. For example, in the list of the top 20 kinases ranked by probability of carrying a cancer driver mutation, there appear to be at least two kinases that are essentially completely uncharacterized in the public literature, MGC42105 & FLJ23074. Another interesting hit is the top one in the list: titin, a protein that hugely deserves its own post. Titin's functions in muscle are well characterized, but a role in cancer would appear to be new. A mutation in KSR2 which resembles kinase activating mutations in other kinases is interesting, as KSR2 has at least sometimes been thought to be an inactive pseudokinase. AURC, whose role in anything remains controversial, shows up with a mutation in a key part of the ATP-binding pocket (P-loop).

There will be a lot of work to actually nail down the role (or lack thereof) of these kinase mutations in cancer. Many other experiments, such as RNAi, have been targeting kinases to try and identify roles in cancer. Most of the mutations observed here were seen in only a few tumors, so there will be lots of work to screen more tumor samples and important cell lines for these mutations. Finally, there are a lot of other genes and gene families (e.g. small GTPases) worth looking at.

However, this is all still very expensive (though the total sequence data, while huge by most standards -- 274Mb of final sequence, pales next to Venter's metagenomics cruise of 6.3Gb). An important question is to what degree should research dollars be invested in these studies vs. other important functional studies (such as RNAi & conditional mouse models, to name just 2). While new sequencing technologies will bring down the costs of large scale sequence scanning, the cost will not go to zero. Balancing the approaches will remain a great challenge for the cancer research community.

Tuesday, March 13, 2007

Sailing the Genomes Blue

Today's Wall Street Journal had an item on Craig Venter's new publication in PLoS Biology describing the collection and metagenomic sequencing of seawater from around the world. You'll need to have paid access to the WSJ, or find a print copy (my access), or perhaps it will show up on a free newspaper site at some point (many WSJ articles do via the wire services). Further information is available on the expedition's website, including pictures of their sailboat Sorcerer II.

The raw numbers are amazing: 6.3 Gbp of raw data -- or about 1.5 human genome equivalents -- and all apparently by 'old-fashioned' fluorescent Sanger sequencing. Samples were collected at regular intervals along the sailing route

There's a lot in the paper, and I won't pretend to have read all of it. One interesting bit is what the authors call 'extreme assembly'. Whereas most genome assembly schemes attempt to minimize the probability of getting chimaeric assemblies (with data glommed together that should be apart), this approach tries to get as big an assembly as possible -- as long as 900Kb from this dataset. While chimaeras are expected (and found), the hope is that you can untangle the knots later but that these extreme assemblies will be useful in collecting sequences together that should go together.

One other nice bit: in addition to deposition at NCBI, the data & tool set will be made freely available at a site called CAMERA. One of my long-held idealistic beliefs in the genome project & bioinformatics is that it can be a great leveler of educational institutions (or more properly, a great boost for many smaller schools). With hardware which is increasingly cheap & ubiquitous, any undergraduate (or high school student!) can do interesting analyses using tools and data which are freely accessible. As an undergraduate, our budget for sequencing was about one kit per semester (and these were the pre-ABI days -- we're talking radioactive dideoxy here) -- and with a little bad luck we never got any useful data. I dabbled with public sequence data then -- but how little there was. Now, an undergraduate funded far worse than I was can have an endless supply of explorations.

The WSJ item brought out one interesting incident: at one point Venter and his crew were apparently placed under house arrest in a Pacific island nation (I forget which one; it was in the article). Treaties on bioprospecting give nations to the right to regulate such activities in their territorial waters, and Venter apparently didn't have the correct permits. Of course, the seawater bugs are probably rather deficient in critical documents such as passports, nor do I expect they swear allegiance to any nation.

Venter has, of course, obtained the career status many claim to dream of (particularly in the context of mega-lottery winnings): he is independently wealthy & gets to combine his favorite leisure activity with further promotion of his scientific interests. Color me several shades of green.

Monday, March 12, 2007

You say tomato, I say $tomato

When I first started programming thirty or so years ago, my choice of language was simple: machine code or bust. I didn't like machine code much, so I never wrote very much. A pattern, however, was established which would be maintained for a long time. A limited set of computer languages would be available at any one time, and I would pick the one that I liked the best and work solely in that. Machine code gave way to assembler (never mastered) to BASIC to APL to Pascal. Transitions were short and sweet; once a better language was available to me, I switched completely. A few languages (Logo, Forth, Modula 2) were contemplated, but never had the necessary immediate availability to be adopted.
A summer internship tweaked the formula slightly -- at work I would use RS/1, because that's what the system was, but at home I stuck to Pascal. For four years of college this was the pattern.

Grad school was supposed to mean one more shift: to C++. However, soon I discovered the universe of useful UNIX utility languages, and sed and awk and shell scripts started popping up. Eventually I discovered make, which is a very different language. A proprietary GUI language based on C++ came in handy. Prolog didn't quite get a proper trial, but at least I read the book. Finally, I found Perl and tried to focus on that, but the mold had been broken -- and for good measure I wrote one of the worlds first interactive genome viewers in Java. My thesis work consisted of an awful mess of all of these.

Come Millennium, I swore I would write nothing but Perl. But soon, that had to be modified as I needed to read and write relational databases, which requires SQL. Ultimately, I wanted to do statistics -- and these days that means R.

There are a number of computer language taxonomies which can be employed. For example, with the exceptions of make, SQL (as I used it) and Prolog all of these languages are procedural -- you write a series of steps and they are executed. The other three fit more of a pattern of the programmer specifying assertions, conditions or constraints and the language interpreter or compiler executes commands or returns data according to those specifications.

Within the procedural languages, there is a lot of variation. Some of this represents shared history. For example, C++ is largely an extension of C, so it shares many syntactic features. Perl also borrowed heavily from C, so much is similar. R is also loosely in the C syntax family. All of these languages tend to be terse and heavily use non-alphabetic characters. On the other hand, SQL is intrinsically loquacious.

The fun part is when you are trying to use multiple languages simultaneously, as you must keep straight the differences & properly shift gears. Currently, I'm working semi-daily in Perl, SQL and R, and there is plenty to catch me up if I'm napping. For example, many Perl and R statements can interchange single and double quotes freely -- as long as you do so symmetrically; SQL needs single quotes around strings.
Perl & R use the C-style != for inequality; SQL is the older style <> and in paralled Perl & R use == for equality whereas SQL uses a single = -- and since a single = in Perl is assignment, forgetting this rule can lead to interesting errors! R is a little easier to keep straight, as assignment is <- . R and Perl also diverge on $ -- for Perl it precedes every single value (scalar) variable, whereas in R it specifies a column of a table. I haven't done C++ or Java for over ten years, but my mind still wants to parse an R variable foo.bar as bar is a member of class instance foo (perhaps because that's the SQL idiom as well), but in R the period is just another legal character for composing a name -- and in Perl it's yet another syntax ( ->{'key'} ) to access the members of a class.

While I know all the rules, inevitably there is a mistake a day (or worse an hour!) where my R variables start growing $ and I try to select something out of my SQL using != . Eventually my mind melts down and all I can write is:
select tzu->{'name'},shih$color from $shih,$tzu where shih.dog==tzu.dog

which doesn't work in any language!

Wednesday, March 07, 2007

Eight Ligands A Leaping

There are few things you can appreciate better than something you have striven hard at yet failed. For a bit of time I was a minor expert in G-protein coupled receptors (GPCRs) -- well, really just the curator of a private database.

GPCRs are molecular wonders. The human genome contains around a thousand of so, but a large fraction of these are olfactory receptors -- our detectors of scents. These are organized into at least three major sequence families -- there were always a few more trying to break in, and I've lost track of the current opinion on these unusual families.

GPCRs have two key characteristics. First, they signal by coupling to heterotrimeric GTP-binding proteins, or G-proteins. Second, they have seven membrane spanning domains. Indeed, the main reason to claim some new looks-like-nothing-else protein as a GPCR was the prediction of this seven transmembrane, or 7TM, character. That 7TM character also makes them crystallographic sinkholes -- I think it is still true that only one crystal structure has been reported (bovine rhodopsin).

GPCRs have an amazing variety of ligands, ranging from small proteins to peptides to sugars to lipids to nucleotides to what have you. As mentioned above, our sense of smell is largely driven by GPCRs -- the discovery of this large subfamily led to a Nobel prize. All sorts of molecules have smells, suggesting the versatility of these proteins. Some fundamental tastes are also detected by GPCRs. Our very entry into this world is governed by a GPCR (oxytocin receptor). Perhaps the most amazing GPCRs are those that detect light and enable our vision. While a photon isn't truly the ligand for these receptors (a photoisomerization product of a covalently bound small molecule is), it is fun to think of it that way. If someday a physiological role is found for a noble gas, I wouldn't want to bet against a GPCR being the receptor for it.

GPCRs are also key drug targets. Many neurotransmitters are detected by GPCRs, along with many important hormones. Because they are such important drug targets, special care was made by every genomics company in sifting through their data to ensure that no GPCR slipped through unnoticed. Many that were found resembled olfactory receptors and probably are -- though sometimes they are clearly expressed in rather peculiar places outside the nose.

Once found, life is not easy. In order to configure a high-throughput screen for a small molecule (a few GPCRs are antibody targets, namely the chemokine receptors), you really need to know what the input is and which G-protein the output is sent out on. This also doesn't hurt in deducing the physiological role for the GPCR. The G-protein is the easy side. The specificity is mostly in the alpha subunit, which there are around 20 of but which also fall into a few subfamilies. Most GPCRs talk to only one of these subfamilies, and better yet for drug discovery there are mutants which seem to be rather promiscuous. So that's taken care of.

But finding a ligand: good luck! Again, since GPCRs seem to bind anything you can assume a novel one might bind just about anything. Treeing them with their kinfolk can suggest possible ligand classes, as neighborhoods on the tree will often have similar ligands, but that's no help if your novel GPCR doesn't look much like the rest. So every lab would throw a small kitchen sink of candidate ligands at their 'orphan' GPCRs and look for a signal -- and based on our experience & what's in the literature, that wasn't very often. New ligands would appear, often in small cascades -- once a new class of ligand was identified (such as short chain fatty acids), then a slew of papers would follow after a bunch of these had been explored on orphan receptors. But the last time I checked my database of receptors of interest without ligands, the list was still long.

One interesting possibility is that some of these receptors don't have specific ligands, because they may not function on their own Heterodimerization of GPCRs has been reported, and other families of receptors (kinases, nuclear hormone receptors) show how proteins lacking in some key receptor functions can still be very important via heterodimerizing with close relatives.

So it is with a bit of envy I view the recent press release from Compugen, an Israeli company that built an informatics approach to identifying novel transcripts and splice variants. They report finding, and demonstrating the function of, eight novel peptide ligands for GPCRs, some for orphan GPCRs and others as additional ligands for previously characterized ones. These are a challenging problem -- one which I and several more clever people at MLNM beat their head on -- and clearly Compugen has done well. Part of their identification relied on finding characteristic amino acid motifs recognized by the proteases which process these peptides -- many peptide GPCR ligands are clipped from larger precursors. Often, multiple ligands are encoded by the same precursor. Finding novel precursors is not trivial -- not only are they very short open reading frames, and therefore are difficult to distinguish from random open reading frames appearing in DNA, but many are also on fast evolutionary clocks -- which means that finding these peptides by cross-searching the human and mouse (for example) genomes isn't always much help.

So hats off to Compugen. I would be shocked if we are done finding GPCR ligands, but to find eight at once is quite an achievement.

Monday, March 05, 2007

What's in a title?

Boston has two major daily papers, The Boston Globe and The Boston Herald. The Globe is the more stately broadsheet, whereas the Herald revels in being the sensational tabloid. My tastes tend strongly towards the Globe, though it sometimes seems more like The Boston Glob, but I do browse the Herald -- when I can get it for free. Yes, I'm a bit of a newspaper snob -- though nothing like James D. Watson, who would apparently put down the Herald (and by extension readers of that paper) on a daily basis when he was at Harvard (I got this first hand from his glasswasher -- who read a Herald daily).

While I have no love for the Herald's style & quality of journalism (e.g.: when the Globe fired a populist columnist for plagiarism, the Herald gleefully scooped him up), I do enjoy their screaming headlines. Short, pithy & fun -- though accuracy and fairness clearly aren't strong selection criteria.

The headlines in scientific journals and newswires tend to be long on long and short on punchy. Perhaps some is an urge to cram as many keywords as possible into the title, and perhaps some is a deliberate desire for dryness. While these titles often fit the purpose, it isn't uncommon to be able to rewrite one for more zazz, especially if you are emailing abstracts to a colleague rather than editing a journal.

Of course, one advantage of long and ponderous is a single possible meaning -- spell it out in detail, and nobody can misinterpret it accidentally -- or deliberately. Rarely can a scientific paper title or newsfeed item become a candidate for Jay Leno's headlines schtick, but it does happen. GenomeWeb is usually a good provider of useful news, but the other week I got a grin out of a headline that could be seen as a politically incorrect description of enlisting patients in their own cause
Sick Kids to Use GenoLogics' Geneus Software in Multi-Lab Stem Cell Research
. Of course, the item really refers to The Hospital for Sick Kids in Toronto.

Other times, someone does put together a clever headline that grabs the eye -- usually with a clever name for a hypothesis
Retaliatory mafia behavior by a parasitic cowbird favors host acceptance of parasitic eggs
-- now there's a memorable piece of jargon!

However, I do not like titles to mislead.
The calorically restricted ketogenic diet, an effective alternative therapy for malignant brain cancer.

If you skip to the bottom of the abstract, it's even worse
This preclinical study indicates that restricted KetoCal(R) is a safe and effective diet therapy and should be considered as an alternative therapeutic option for malignant brain cancer.

It's an interesting idea (with precedent in the literature), but 'safe & effective'? The key term left out of the title is 'xenograft mice'. Only proven so if you are a xenografted mouse, a population for which a huge variety of 'cures' already exist. The abstract as a whole isn't bad, but I'll hardly be shocked if I start seeing ads touting the final sentence without qualification.