Omics! Omics!: 2008

Tuesday, December 02, 2008

A few questions for Governor Palin

It's hard to believe that it's been a full month since the historic election. Well, depends on how you count a month, but today is the first Tuesday after the first Monday in December.

I was more of a political junkie in my youth, but I haven't sworn off the habit. Only in the last few days was I attempting to handicap the electoral college. TNG was a huge Obama fan, asking every adult in sight whether they would be voting for him. On the flip side, the other ticket had Miss Amanda quite charged up -- the idea of a Canino-American being one heartbeat from the presidency was too much to resist (though she has declared she will nip any groomer who attempts to apply lipstick to her!). Her disappointment that night was quickly salved by Obama's first major policy declaration in his celebratory speech. Alas, her closest kin have not been mentioned as in the running for the White House staff position.

Speaking of Governor Palin, it seems she will not be fading from the limelight. No, indeed it looks like her personal Iditarod will be going for the nomination in 2012. Alaska's chief executive made a number of comments during the campaign which induced consternation in the scientific community. Granted, the fruit fly remark was specifically about research on a totally different bug than Drosophila in a completely agriculturally-targeted setting, but it didn't endear her to the fans of Morgan & Bridges. Given she has four years to prepare, it wouldn't hurt to start now. And, in the spirit of reuse, should she not run it would seem the majority of these queries would apply to the majority of other Republicans who went for the high office this year.

1) You have publically taken stands that some views held by a minority (or less) of the scientific community should be accepted and used as the basis for policy decisions (e.g. the existance and/or cause of global warming trends) and/or taught in public schools as viable alternatives to the majority view (e.g. creationism). How do you choose which 'maverick' scientific theories have merit and which do not?

2) Which of the following maverick theories, relevant to major issues in this country today, should be taught in public schools or used to guide policy:

2.1) Healthcare (research priorities, Medicare/Medicaid reimbursement policy)

2.1.1) Childhood vaccines cause autism

2.1.2) AIDS can be treated more effectively with vitamin combinations than antiretrovirals

2.1.3) AIDS is caused by lifestyle factors and not the virus HIV

2.1.4) High cholesterol levels do not cause heart disease; cholesterol lowering using drugs risks cancer & depression

2.2) Physical sciences

2.2.1) Petroleum is not a limited supply of fossil remains of ancient lifeforms but rather is constantly created by processes deep in the earth (clearly an area where Ms. Palin has declared as in her sphere of expertise)

2.2.2) Manned space travel through the van Allen belts is guaranteed to be lethal; funding an attempt to land on the moon should be cancelled.

2.2.3) Einstein's Theory of Relativity is clearly wrong, as the concept of time dilation is so opposed to normal experience as to be laughable.

3) Should the U.S. government ever fund research outside its borders? Under what conditions should such operations be funded, if ever?

4) To what degree should non-expert politicians alter the research funding priorities set by experts in the field?

5) What, if any, useful science has come from studying fruit flies? Should the U.S. fund any further research? What other organisms do you also feel are not worth researching?

This is just a draft; readers are invited to submit further questions via the comments

Saturday, October 18, 2008

Cilantronomics

We had dinner last night in one of favorite local eateries, a wonderful little Mexican place in a neighboring town. When I sat down, my eye was drawn immediately to my sort of dish -- one with a rich sauce combining the tang of tomatillos with the zing of cilantro. I picked well.

I really do love cilantro. Despite an extensive garden growing up, it was only in my adult life that I encountered this herb. I've been making up ever since. It works well in so many situations, not only in Mexican but also a lot of Asian cooking. The excellent Tibetan buffet in Central Square uses it extensively, particularly in a salad that works equally well before the meal as after, with the bite of cilantro contrasting with sweet cherry tomatoes and mango chunks. I even have a pot of it on my desk, which I share with my neighboring cilantrophiles.

However, not everyone loves cilantro. And it isn't just some folks might not like that little edge -- no, for some it tastes awful. Rather than some herbal bite, they taste soap. Or weirder. What other herb has its own http://www.ihatecilantro.com/?

Is it genetic? Alas, there has been a dearth of research on cilantro tasting -- indeed, it doesn't seem to rate an OMIM entry. There is a compound called PTC which is known to untastable by some, including this correspondent, and is genetically linked (my father can taste it; haven't surveyed the rest of the clan). With the help of some research by one of my office neighbors (and fellow cilantro fan), I did learn that 23andMe includes cilantro taste in their questionnaire. It isn't clear whether the other public and private genome projects are tracking this key phenotype.

Okay, I jest a bit. But while the ability to taste cilantro, or PTC, or the host of other innocuous traits which are staples of grade school genetics labs (e.g. widow's peak, hitchhiker's thumb, attached earlobes, etc) aren't exactly critical to understand, they will be interesting to understand. Widow's peak doesn't change someone's life, but to understand it is to understand a bit more about how patterns are laid out. The sciences of smell and taste have advanced tremendously over my lifetime; a whole new taste was found! Identification of smell receptors (recognized by a Nobel) and taste receptors have given great insights -- but we still understand very little.

Are there practical applications for smell & taste research? Of course. But to me the most interesting part is to figure out how it works. PTC doesn't seem so complicated, as the test paper doesn't have any flavor other than paper. But cilantro seems like a much more complicated, and interesting, question. Why does it taste bad rather than just not taste?

Is there an underlying soapiness which I just don't taste? In this case, tasters have a receptor for the magic compound (which is what?) and non-tasters simply lack it. Or does a different receptor bind the compound in tasters, in which case they have a gain-of-function mutation? Or, perhaps they have a partial loss of function -- there are a number of known compounds with concentration-dependent odor, probably due to differential binding to different receptors. In other words, at low concentrations these compounds bind to high-affinity receptors (yielding one perception) and at high concentrations some additional one Or, perhaps a partial gain of function in the non-tasters -- the same model could apply.

No, I wouldn't recommend basing an R01 application on the science of cilantro taste. Nor is it likely to tease a few million from some VCs as the core of a business plan. Cilantro haters will probably never have the option of genetic therapy to alter their perception. But it is still an interesting scientific question, and I look forward to personal genomics shedding some light on it.

Wednesday, October 15, 2008

The Blue Bus Grows Up

A striking characteristic of the Cambridge biotech scene is how it is concentrated in an urban setting. While there are a lot of biotechs elsewhere in Massachusetts, the Hub's hub is clearly a 2+ mile long zone. One challenge this offers is getting to work via Boston's transportation system.

Boston doesn't have an awful transportation network, but it isn't golden either. The transit system is decent, but the routes still largely follow a radial design, with routes that have changed little in the last half century (I kid not; I've seen a map that old & it takes a careful eye to find the differences). An extensive network of commuter rail feeds the downtown, but is split between two termini separated by a mile. The highway network has a number of gaping gaps, due to a mass cancellation of uncompleted highways in the early 1970's. However, this wasn't necessarily bad for biotech; I've spent half my career in offices that would literally be in the middle of the road should those highways have been built (for example, this very different vision for 640 Memorial Drive than a genomics-based pharmaceutical company).

The commuter rail option presents a particular challenge. One station, South Station, is connected to the Red Line subway which has two stops (Kendall & Central) proximal to many biotechs. The other station, North Station, has terrible connections to Cambridge, other than the perhaps future expansion of the zone into the Cambridge-Charlestown-Somerville interzone. But, if you live north of town it's either deal with getting from North Station to Cambridge or brave I-93. So, by multiple subway connections or a tortuous pedestrian path through Mass General's campus to the Red Line, a not tiny cohort of biotechies has made their commute this way.

Then about five years ago a new option appeared. Little blue buses promising a single seat ride from North Station to Cambridge, with a twisting route designed to be near nearly every major employer in the zone. Called EZRide, for $1 anyone can ride, but better yet the larger employers offer ride-all-year stickers.

The service was a bit slow starting up & went through a few hiccups, but over time it has been impressive. If memory serves, the initial frequency was every 30 minutes; this has been steadily dropped so that now a bus shows up every 8 minutes during commute time. However, demand has grown even faster; during core commuting times the bus is at its legal limit, with only ~30 seats and less than a dozen legal standees.

But a couple of weeks ago the announcement came out: a new vendor would be running full size buses on the route. And last week they showed up. On the one hand, there are more seats -- but not as many as one might think due to the layout. On another, more legal standees and less of the EZRide shuffle -- having to exit the bus at the early stops to let people off, as if you were standing you were a cork in aisle of the old buses. The longer buses don't handle the tight turns as well but do away with the most unpleasant aspect of the old buses: their short wheelbase combined with Cambridge's potholes yielded an amusement-park quality bumpy ride (particularly unpleasant if you made the mistake of leaning against the wheelchair lift).

EZRide isn't run by the transit system, but rather by a quasi-public entity called CRTMA which is charged with improving transit into Cambridge. The director, Jim Gascoigne, is energetic and personable and often on the scene, particularly when weather or accidents snarl require emergency re-routing.

In some ways the EZRide highlights issues in Boston. The T does an okay job, but it apparently never occurred to them in several decades that a market existed for a route from North Station to Cambridge. I've also seen the truly surreal quality of Boston from the blue bus: due to some construction, one Boston police officer directed the bus to stop in a new location, where the driver was promptly berated & ticketed by a second Boston officer. This is the town where the mayor had major apoplexy when a nearby airport added Boston to their name; transportation issues are about turf battles as much as moving people around. The planners also have a fondness for expensive megaprojects (the current shopping list can be found here). Several of these would have important benefits for the biotech zone -- for example, the proposed Urban Ring would run right through it & connect the zone to the Longwood Medical Area.

However, perhaps what is more realistic are more EZRide-like services, perhaps connecting to the south (Brookline, Brighton/Allston) that have surprisingly poor connections, or to the large transit hubs to the north (Wellington, Anderson). A direct connection to Charlestown wouldn't be a bad concept either.

In the meantime, I'll keep riding EZRide. And anxiously awaiting my train line(s) getting the free WiFi service a few lucky commuters have gotten to pilot. One more good reason to stay off the road and out of my car!

Tuesday, October 14, 2008

Panda genome arrives

China announced over the weekend the completion of the giant panda genome.

For the benefit of presidential candidates who can't conceive of the value of scientific research on bears I'll suggest a few questions worth exploring in the panda genome (beyond the obvious direction of weapons development)

First, the panda genome is one more mammalian genome to add to the zoo. For comparative purposes you can never have too many. Since other carnivore genomes are done (first & foremost the dog, but cat as well), this is an important step towards understanding genome evolution within this important group. It is the first bear genome, but with the price of sequencing falling it is likely that the other bears will not be in the extremely distant future (with the possible exception of Ursa theodoris).

Second, completion of a genome gives a rich resource of potential genetic variants. In the case of an endangered wildlife species such as panda, these will be useful for developing denser genetic maps which can be used to better understand the wild population structure and the gene flow within that structure. Again, if you are running for president please read this carefully: this has nothing to do with paternity suits. If you want to manage wildlife intelligently and make intelligent decisions about the state of a species, you want to know this information.

Third, pandas have many quirks. That bambooitarian diet for starters. Since they once were carnivores, it is likely that their digestive systems haven't fully adapted to the bamboo lifestyle. Comparisons with other carnivores and with herbivores may reveal digestive tract genes at various steps in the route from meat-eater to plant-eater.

Fourth, as the press release points out, there are many questions critical to preserving the species which (with a lot of luck) the genome sequence may give clues to. First among these: why is panda fertility so low? U.S. zoos have been doing amazingly well in this century, but that's only 4 breeding pairs. The Chinese zoos have many more pandas & many more babies, but it's going to take a lot more to save the species.

Thursday, September 18, 2008

Great Galloping Gerbils!

An item on CNN mentioned that satellite technology will be employed to monitor the endangered California Kangaroo Rat. This reminded me of a Nature paper this summer I meant to mention, because the image blew me away (plus it's the first time I've seen Google Earth used as a source for a scientific paper!).

The paper is about models of disease spread (these gerbils are reservoirs for plague), but the thing which jumped out was the Google Earth image; the two images above are from around the same region of Kazakhstan. The gerbils clear vegetation from around their burrows, and these burrows are in huge complexes. The more zoomed in image above is several kilometers wide and yet is packed with gerbil burrows. If you have Google Earth and look around 44.766991 76.449699 you can zoom way out and still see the gerbil complexes. I saw some huge prairie dog towns out west when I was a boy, but nothing on this scale!

How many animals leave traces which can be seen from 30+Km up (the image quality is uneven for this region of the world in Google Earth -- clearly shots are merged from different seasons and resolutions, but 30Km is a conservative estimate)? Human activity obviously. When I think of animal-built structures I generally jump to beaver dams or termite mounds, but they aren't nearly this extensive.

Monday, August 25, 2008

The Joys of DIY Dynamic Programming

For nearly the first decade of my bioinformatics career I carried around a dirty little secret -- well, at least at times I felt it was one. I had coded many things, I could explain many algorithms, but I had never coded a dynamic programming alignment algorithm -- the core to so much I did. I had slightly hacked one version (just to have it do an all-all comparison of a database, doing each possible pairing only once). Finally, for a bunch of reasons, I sat down and did it -- my very own Smith-Waterman implementation.

I'm reminded of this because a couple of weeks ago I rolled back my sleeves and knocked one out again. Now, just the fact I did this reveals a bit about me. I did find at least two freely available C# implementations on-line (e.g. the C# version of JAlign) and there is a plethora of C implementations. There is also Ewan Birney's magnificent Dynamite, pretty much the catch-all for the field (Dynamite is a programming kit for doing this; in effect a programming language for dynamic programming). But, partly as a point of pride & partly because I saw I'd need to hack the one C# copy I looked at in detail, I did it. I even wrote a schmancy version -- a simple cDNA to genomic sequence aligner with two classes of gaps (one being an intron, with a really trivial model of a splice junction -- I think it used dinucleotides) All coded in Perl -- no speed demon, but it solved the problem where we needed it.

Now, it took me a good few hours to do it -- better than the few days of the first time, but not instant. I can claim that this time I didn't fall back on any study aids, such as the many online descriptions or Eddy & Durbin's & co. very well written book.

The implementation says a lot about me too. I thought of many ways to code it and finally settled on one. For example, there is the question of how to represent the alignment matrix; I used a two-dimensional array scheme (actually implemented using dictionaries -- a holdover from my Perl-centric days) but I could have also made it a graph of nodes. There is also the actual thrashing through the matrix -- the algorithm is inherently recursive, but following familial idiosyncracies I wrote the code to use loops -- well, actually I completely waffled and implemented so it can use recursion, but actually loops through! The applications I'm considering are going to be short alignments, so I didn't worry about memory efficiency (who wants to be that will bite me back!) nor did I fixate on speed (care to double the bet?) -- indeed, I wrote it to allow all sorts of baroque variations, such as different penalties for opening gaps in the two different sequences & for basic profile-to-sequence alignments. Plus it is either Smith-Waterman (local) or Needleman-Wunsch-Sellers (global), with a simple toggle.

So now the pitch: If you are a bioinformatics programmer & you haven't written one, I urge you to do it. It's great practice & nothing illustrates an algorithm like trying to implement it. If you don't consider yourself a programmer, guess what? It's perhaps not the obviously easy first start, but just thinking about it will stretch your mind. Plus, you get a free bioinformatics Rorschach test from your implementation choices!

One last thought: who can think up (and execute) the most comically baroque -- but functional -- implementation of S-W/NWS? Has it already been done in PostScript? How about in a relational database (I've written some pretty baroque SQL this year, but I doubt I could tackle this)? S-W as an Excel spreadsheet? Coded with glider guns? A full description for a true Turing machine? Of course, the grand prize winner would clearly either be to build a DNA computer to compute an alignment -- but perhaps that could even be topped by implementing the algorithm with living cells as the alignment cells!

Sunday, August 24, 2008

One more Olympic thought

One other item that was in the mental draft of yesterday's Olympic pondering, but was inadvertantly dropped. Another possible genetically-driven edge in athletic performance would not be directly on performance but on the reaction to performance. Prime athletes might have different pain or endorphin responses, less post-exercise inflammation, different injury responses. Some of these might be specific to specific events or types of sports -- joint pounding running or gymnastics puts very different stresses on the body than something like swimming or speedskating.

Saturday, August 23, 2008

An Olympic Pondering Decathlon

Okay, my biannual stint of Olympic watching is about to conclude. A bunch of speculations suggested by this year's stretch, starting with the utterly unscientific and ending with more genomic oriented queries.

1) Having now watched two Olympics using a Digital Video Recorder, it's completely clear that having a few fast forward speeds is no way to navigate multi-hour recordings to find what you want. Surely there are better UIs for this! The thumbwheel on an iPod is one obvious choice, but there must be other ways.

2) The Summer games are blessed with multiple events which touch on multiple disciplines: the decathlon, heptathlon, modern pentathlon & triathlon. Why isn't there a true multi-discipline Winter sport? Biathlon is a glorious combination of two diametrically opposed skills -- racing and precision shooting -- and there is also the Nordic combined, but neither of these sample a wide range. How about this for a Winter hexathlon:
A) 500m long-track speedskating
B) 2500m long-track speedskating
C) Downhill skiing
D) Slalom
E) 10Km X-C skiing
F) ski jump (small hill)

3) I once contemplated attending a HUPO meeting in Beijing; atop an interesting program there was a post-conference trip option to tour the country. There were two issues: I'd have to foot my own travel expenses & I'd be in serious hot water at home for visiting the Wolong panda center solo. But, now I'd definitely go -- particularly if they replaced a typical dry scientific kickoff with another Zhang Yimou spectacular, I wouldn't hesitate!

4) As a kid I did occasionally have Olympic daydreams. I'm a bit over the hill now, but between athletes older than I such as Dara Torres and hearing that some countries will field just about anyone, perhaps I gave up too soon. If I could pick anything, it would be long track speedskating, the most graceful speed sport bar none. But, more realistically perhaps I could go for the 1500 meter freestyle swimming -- I'd estimate my time is off by only a factor of 3 -- perhaps with some regular training I can get that down to 2!

5) Of course, even with some extensive training I'd look pretty odd on the blocks -- I'm 5'8" and from what I can tell in the TV coverage, it's a rare swimmer who isn't a few inches over 6'. Clearly there are advantages to height in a large number of sports -- but I'm also clearly too tall for a shot at women's gymnastics (atop the other obvious issue). It would be interesting to see which sports have the highest and lowest dispersion in athlete height -- and what those patterns look like. What sport should I have chosen based only on my height?

6) During the Olympics, the world's tallest living woman died, after a life with many health difficulties. Clearly, there are limits to the advantages to height. What sport has the tallest athletes?

7) What more subtle anatomic characteristics might lead to athletic advantage? Differences in muscle fiber composition are an oft-cited one. A TV profile of superswimmer Michael Phelps claimed he is 'double jointed'. But what else. For example, are there subtle differences in some individual's lungs which lead to more efficient air exchange? Smoother surfaces on bone joints?

8) Diving lower, are there biochemical differences? Again, could there be differences in oxygen transfer or usage? Differences in energy metabolism?

9) If we did genome screens of the athletes, what SNPs would we find over-represented? How many of those would be 'obvious' and how many would lead to new genes which influence performance? Already there is at least one company offering genome scans to predict what sport you should stuff (er, steer) your kid into.

10) The sad story of Flo Hyman illustrates another aspect of selection for unusual body types: she died of a aortic dissection due to Marfan's syndrome, which probably also led to her tall, thin stature which was an advantage on the volleyball court. What other genetic variants have a dicey risk/reward trade-off in the athletic arena? And how many of these are a serious medical issue for regular folks?

Friday, August 22, 2008

Any leads on when pandas join the genome club?

Tonight's bedtime conversation veered all over the map (par for the course), but at one point touched on the announced Chinese effort to sequence the genome of the giant panda. Of course, a key concern was that this did not involve any pain or injury to the beloved bicolors, so the concept of buccal swabbing was introduced.

This project was announced last spring. I thought it was planned to be released during the Beijing Olympics, but that is apparently an invention of my imagination.

So, anyone out there in the know care to hint or leak? When will the first ursid genome arrive? And who was the lucky bear?

Wednesday, August 13, 2008

Larry Ellison, please join the 21st century!

I've sniped at Microsoft at least once, so in the interest of balance I'll take a crack at the other software giant I rely on but also frequently complain about: Oracle.

Oracle is truly amazing. Now, I don't have much experience with other relational database systems, so this isn't comparative. But relational databases are amazing. I give it a query of what I want and if I cross all my t's and dot all my i's, then huge databases are searched rapidly (often a matter of seconds).

My first complaint is with inconsistency in syntax. Oracle has several flavors of text types depending on how big you might let your text get. I mostly query databases, not create them, and so I generally want to treat them all the same. Now there might be some good reason I need to use a different function to get the substring from each type, but I really don't want that hassle. But if I'm stuck with it, why couldn't you keep the argument orders the same? Standard substring, like every substring method I've ever met, has the order: string, start, length. But for the really big text columns ("CLOB"s), it's string, length, start. WHY???

But worse, is when I'm having trouble dotting those i's and crossing the t's, Oracle really doesn't give much help. The error messages are somewhere out of the 60's.

For example, one handy feature of Perl (and other environments) is some attempt to identify common pitfalls and give hints about them in the error messages. A common mistake for me is to include an extra , in my query

select x,y,z,
from mytable

In this example, of course, it's small -- but many of my queries are 30-40 lines long. Surely it could detect that the unrecognized field name is a reserved word and therefore hint that I've included an extra comma.

Another example. For one query I have I've been parsing out a numeric string and then trying to convert it to a number. Alas, somehow my parse is failing and I'm getting some unconvertable strings back. Oracle gives me an error that it can't convert something to a number -- but keeps that something a secret from me!

I could go on-and-on. Line numbers for the error are frequently non-helpful, the error messages don't give the context of the offending bit, etc, etc.

The one thing I haven't tried is to edit my queries in Visual Studio, which has an SQL mode. I really should try that -- not that VS's error messages are always golden, but it is good about highlighting the likely neighborhood of mistakes in a way SQL Developer (the Oracle interface I use) just doesn't even attempt

Ah well, I'll live. Larry probably has bigger fish to fry. Personally, though, if it was my software I'd be cringing.

Thursday, July 31, 2008

Farewell to some bits of olde Cambridge

We sent off one of our departing colleagues in style yesterday, taking him to the finest cuisine in Cambridge: the MIT Food Trucks. These institutions are various privately run trucks serving hot foods, from around the world, to long lines of students. While private, the trucks are sanctioned: not only do they have specially reserved parking spots but they also are listed in the MIT Food Service website.

However, first we had to find them. Their previous locale is now a major hole in the ground, to be filled in with the new Koch Cancer Institute (or some such name). With the MIT web site's help, we were able to find the new location.

During the year I (and others) have discovered two other institutions which were not so lucky.

It was a bit of a shock one day to discover the Quantum Books location cleared out, though not much after recollection. Quantum was a bookstore specializing in technical books -- particularly computing books. It was a handy place to browse such books before investing; I've spent far too much on books that looked good but were awful. The not so much shock was on thinking about it: not only was Quantum getting hammered by the usual Amazon internet tide, but they were in a perfectly awful location. While there might be a lot of commuter traffic, otherwise they were in a nearly retail-free zone that is one of the many crimes against urban design inflicted on Kendall Square in the 60's/70's. They tried to have a children's section and other experiments, but it was hard to see much hope of success. Quantum isn't kaput, but has gone to a nearly totally Internet model, but unless their fans are super-loyal, it's hard to see that lasting long.

Cambridge was once a center of conventional industry. For example, a huge fraction (I forget the amount; it's on a plaque in the park on Sidney Street) of the undersea telegraph cable used in WW2 was created in Cambridge. But fewer and fewer remain. Even in my short tenure at least 2 candy factories have closed, leaving only one left (tootsie rolls!). A prominent paint company moved out a few years ago. Sometimes it's hard to tell what's still active & what is only an empty shell. But not in this case.

There will be no more "goo goo g'joob" in Cambridge; Siegal egg company has not only cleared out but been cleared out -- the building is gone. A distributor of eggs, they were across the street from one MLNM building and adjacent to Alkermes. Indeed, it was that proximity to MLNM that forced me to notice them: their egg trucks would sometimes block Albany Street while backing into the loading dock, trapping the MLNM shuttle van (always with me late for a meeting!). I think the demolition was part of the adjacent MIT dorm construction, but perhaps a new biotech building will go in. By chance, the Google street view catches the building being prepared for demolition.

Will some future writer remark wistfully on the disappearance of biotech buildings from Cambridge? It's difficult to imagine -- but who a century ago could have imagined Cambridge getting out of the business of supplying everyday things.

Wednesday, July 30, 2008

Paring pair frequencies pares virus aggressiveness

Okay, a bit late with this as it came out in Science about a month ago, but it's a cool paper & illustrates a number of issues I've dealt with at my current shop.. Also, in the small world department one of the co-author's was my eldest brother's roommate at one point.

The genetic code uses 64 codons to code for 21 different symbols -- 20 amino acids plus stop. Early on this was recognized as implying that either (a) some codons are simply not used or (b) many symbols have multiple, synonymous codons, which turns out to be the case (except in a few species, such as Micrococcus luteus, which have lost the ability to translate certain codons).

Early on in the sequencing era (certainly before I jumped in) it was noted that not all synonymous codons are used equally. These patterns, or codon bias, were specific to specific taxa. The codon usage of Escherichia is different from that of Streptomyces. Furthermore, it was noted that there is a signal in pairs of successive codons; that is that the frequency of a given codon pair is often not simply the product of the two codon's individual frequencies. This was (and is) one of the key signals which gene finding programs use to hunt for coding regions in novel DNA sequences.

Codon bias can be mild or it can be severe. Earlier this year I found myself staring at a starkly simple codon usage pattern: C or G in the 3rd position. In many cases the C+G codons for an amino acid had >95% of the usage. For both building & sequencing genes this has a nasty side-effect: the genes are very GC rich, which is not good (higher melting temp, all sorts of secondary structure options, etc).

Another key discovery is that codon usage often correlates with protein abundance; the most abundant proteins show the greatest hewing to the species-specific codon bias pattern. It further turned out that highly used codons tend to be most abundant in the cell, suggesting that frequent codons optimize expression. Furthermore, it could be shown that in many cases rare codons could interfere with translation. Hence, if you take a gene from organism X and try to express it in E.coli, it would frequently translate poorly unless you recoded the rare codons out of it. Alternatively, expressing additional copies of the tRNAs matching rare codons could also boost expression.

Now, in the highly competitive world of gene synthesis this was (and is) viewed as a selling point: building a gene is better than copying it as it can be optimized for expression. Various algorithms for optimization exist. For example, one company optimizes for dicodons. Many favor the most common codons and use the remainder only to avoid undesired sequences. Locally we use codons with a probability proportional to their usage (after zeroing out the 'rare' codons). Which algorithm is best? Of course, I'm not impartial, but the real truth is there isn't any systematic comparison out there, nor is there likely to be one given the difficulty of doing the experiment well and the lack of excitement in the subject.

Besides the rarity of codons affecting translation levels, how else might synonymous codons not be synonymous? The most obvious is that synonymous codons may sometimes have other signals layered on them -- that 'free' nucleotide may be fixed for some other reason. A more striking example, oft postulated but difficult to prove, is that rare codons (especially clusters of them) may be important for slowing the ribosome down and giving the protein a chance to fold. In one striking example, changing a synonymous codon can change the substrate specificity of a protein.

What came out in Science is using codon rewriting, enabled by synthetic biology, on a grand scale. Live virus vaccines are just that: live, but attenuated, versions of the real thing. They have a number of advantages (such as being able to jump from one vaccinated person to an unvaccinated one), but the catch is that attenuation is due to a small number of mutations. Should these mutations revert, pathogenicity is restored. So, if there was a way to make a large number of mutations of small effect in a virus, then the probability of reversion would be low but the sum of all those small changes would be attenuation of the virus. And that's what the Science authors have done.

Taking poliovirus they have recoded the protein coding regions to emphasize rare (in human) codon pairs (PV-Min). They did this while preserving certain other known key features, such as secondary structures and overall folding energy. A second mutant was made that emphasized very common codon pairs (PV-Max). In both cases, more than 500 synonymous mutations were made relative to wild polio. Two further viruses were built by subcloning pieces of the synthetic viruses into a wildtype background.

Did this really do anything? Well, their PV-Max had similar in vitro characteristics to wild virus, whereas PV-Min was quite docile, failing to make plaques or kill cells. Indeed, it couldn't be cultured in cells.

The part-Min part wt chimaeras also showed severe defects and some also couldn't be propagated as viruses. However, one containing two segments of engineered low-frequency codon pairs, called PV-MinXY, could but was greatly attenuated. While its ability to make virions was slightly attenuated (perhaps one tenth the number), more strikingly about 100X the number of virions was required for a successful infection. Repeated passaging of PV-MinXY and another chimaera failed to alter the infectivity of the viruses; the attenuation stability through a plethora of small mutations strategy appears to work.

When my company was trying to sell customers on the value of codon optimization, one frustration for me as a scientist was the paucity of really good studies showing how big an effect it could have. Most studies in the field are poorly done with too few controls and only a protein or two. Clearly there is a signal, but it was always hard to really say "yes, it can have huge effects". Clearly in this study of codon optimization writ large, codon choice has enormous effects.

Tuesday, July 29, 2008

The youngest DNA author?

Earlier this year an interesting opportunity presented itself at the DNA foundry where I am employed. For an internal project we needed to design 4 stuffers. Stuffers are the stuff of creative opportunity!

A stuffer is a segment of DNA whose only purpose is to take up space. Most commonly, some sort of vector is to be prepared by digesting with two restriction enzymes and the correct piece then purified by gel electrophoretic separation and then manual cutting from the gel. If you really need a double digestion then the stuffer is important so that single digestion products are resolvable from the desired product; the size of the stuffer causes single digests to run at a discernibly different position.

Now, we could have made all 4 stuffers nearly the same, but there wasn't any significant cost advantage and where's the fun in that? We did need to make sure this particular stuffer contained stop codons guarding its frontiers (to prevent any expression of or through the stuffer), that it possess the key restriction sites and that it lack a host of other sites possibly used in manipulating the vector. It also needed to be easily synthesizable and verified by Sanger sequencing -- no runs of 100 As for example. But beyond that, it really didn't matter what went in.

So I whipped together some code to translate short messages written in the amino acid code (obeying the restriction site constraints) and wrap that message into the scaffold framework. And I started cooking up messages or words to embed. One stuffer contains a fragment of my post last year which obeyed the amino acid code (the first blog-in-DNA?); another celebrates the "Dark Lady of DNA". Yet another has the beginning of the Gettysburg Address, with 'illegal' letters just dropped. Some other candidates were considered and parked for future use: The opening phrase to a great work of literature ("That Sam I am, That Sam I am" -- the title also work!), a paen to my wagging companion,.

But the real excitement came when I realized I could subcontract the work out. My code did all the hard work, and another layer of code by someone else would apply another round of checks. The stuffer would never leave the lab, so there was no real safety concern. So I offered the challenge to The Next Generation and he accepted.

He quickly adapted to the 'drop illegal letters' strategy and wrote his own short ode to his favorite cartoon character, a certain glum tentacled cashier. I would have let him do more, but creative writing's not really his preferred activity & the novelty wore off. But, his one design was captured and was soon spun into oligonucleotides, which were in turn woven into the final construct.

So, at the tender age of 8 and a few months the fruit of my chromosomes has inscribed a message in nucleotides. For a moment, I will claim he is the youngest to do so. Of course, making such a claim publicly is the sure recipe to destroying it, as either someone will come forward with a tale of their toddler flipping the valves on their DNA synthesizer or will just be inspired to have their offspring design a genome (we didn't have the budget!).

And yes, at some future date we'll sit down and discuss the ethics of the whole affair. How his father made sure that his DNA would be inert (my son, the pseudogene engineer!) and what would need to be considered if this DNA were to be contemplated for environmental release. We might even get into even stickier topics, such as the propriety of wheedling your child to provide free consulting work!

Monday, July 28, 2008

The challenge of promulgating bear facts.

I had an opportunity this evening to briefly review the impact of DNA research on the taxonomy and conservation of Ailuropoda melanoleuca which also made me reflect on the frustrating struggle of scientific fact to struggle to the public forefront. Put more simply, we had a bedtime discussion of pandas, DNA, relatedness & poop.

As I've mentioned before, some innocent parental actions resulted in the strong imprinting of pandas on my greatest genetics project, so that our house is now filled with various likenesses of the great bicolor Chinese icons. That I can see only 3 where I am sitting now is surprising -- and partly reflects the fact it is dark outside. We have numerous books on giant pands and the school & public libraries have supplied more, and tonight a new little book from Scholastic arrived mysteriously on TNG's pillow. He was eagerly reading it when he came to the fateful passage "It says they're not bears!". But 'The Boy' knows better, and he knows why.

This is a recurring theme in panda books. For a long time the taxonomic placement of pandas was a matter of great dispute, with some assigning them bearhood, some placing them with raccoons, and some allotting pandas a unique clade. A related question concerned the affinity of giant pandas for red pandas and red pandas with the other carnivores. Finally, in the late 1980's the problem yielded to molecular methods, with the clear answer that pandas are bears, albeit a the root of the ursine tree.

What's surprising is how slowly this information has moved into the world of children's books. Of course, the public & school libraries often have books which predate the great resolution, so they are forgiven. Some explain that pandas are bears, but fail to give the evidence. And a few have caught up. But this Scholastic book wasn't one of them, despite having an original copyright solidly after the molecular studies AND a bunch of professors listed as advisors.

Given that TNG is so fond of pandas, and it is no secret, there are those (often adults) who will attempt to dissuade them in their bearness. So I've tried to coach him in how to go beyond simply asserting that they are bears, but explaining why science classes them so. And for an eight year old, he can give a pretty good 1-2 sentence summary.

Which leads us to scat. He merges the two a bit, certainly because of the affinity of his age group for matters excretory (which, of course, his cunning father considered in introducing this topic!). A key question in panda conservation is how many are in the wild. Between their secretive habits and dense bamboo forest habitat, it is difficult to spot a panda in the wild, let alone make a census (nevermind those questionnaires!). So, as with many wild animals, DNA from panda scat is a convenient way to track individuals, and with this tracking the estimate of the number of pandas has shot up -- from the really depressing (to panda fans) ~1500ish to perhaps about a thousand more -- still in grave peril as a wild species, but a thousand more pandas napping in the woods is something to cheer. Unfortunately, the items on pandas in kids magazines & kids sections of newspapers still often quote the older figure.

A similar sort of experiment came up as an item of controversy earlier this year. There are many things I find admirable about John McCain (which is not synonymous to say I'm voting for him -- I haven't decided & I won't tell once I do!), but his pandering about a bear issue earlier this year wasn't one of them. In his fight against congressional earmarks (a good thing), he had singled out a study in Yellowstone National Park's which was sampling DNA from grizzly scat. Amongst his assaults on this study was the question asked loudly of what good this would do beyond setting up a bear dating service. Now, on the one hand I think scientists should be carefully thinking why this important study is apparently being funded by earmark and not peer review. But it is truly sad when you can explain population sampling to an eight year old, but not to someone older than his father who wishes to run the country. Yes, bear counting isn't quite on the same scale as some of the other great scientific issues which are being discussed this election year. But, given that the source behavior of that study is often cited as a benchmark for veracity ("Does a bear..."), it wouldn't be a bad one to get right.

Sunday, July 27, 2008

Do we know how Velcade doesn't work?

In my recent piece on the proteasome inhibitor Argyrin A, a commenter (okay, so far THE commenter) noted something I can't argue with, that there is not a well nailed-down understanding of why proteasome inhibition is lethal to tumor cells. I probably should write up a further exploration of how they might work, but I really need to skim the literature for any new findings (so far, nothing stunning).

As Yogi Berra might say, Velcade (bortezomib) is effective in cancer except when it isn't. Indeed, in cell lines in culture the stuff is devastating, but that certainly isn't what's seen in the clinic. In that setting, there is this obnoxious tease of a signal in Phase I (remember, in oncology Phase I is tried in patients with the disease, not with healthy volunteers as in most indications) followed by cruel let-down in Phase II. Even in diseases where the drug works, such as myeloma, it doesn't work in all patients and some patients become resistant. Perhaps that resistance is the key to the puzzle: understand how tumors stop being sensitive and you'd understand the ones which are never sensitive to start with.

Three recent papers (in Blood, J Pharmacol Exp Ther & Exp Hematol) have found the same mechanism for this transition. Alas, none are free & I've only read the abstract (one of many reasons to swing by the MIT library soon All point to overexpression and/or mutation in PSMB5, the proteasome subunit which binds Velcade. Two of the papers report different point mutations, but both in the Velcade binding pocket and in at least one a reduced affinity for Velcade was demonstrated. Game, set & match?

Well, perhaps not. First of all, all three studies are in cell lines, two in closely related ones. As noted above, cell lines are highly imperfect for exploring proteasome inhibition in particular (and not uniformly reliable for oncotherapeutic pharmacology in general). Judging from the abstracts, none of them went fishing around in patient samples, or if they did they came up dry. Given that PSMB5 is an obvious candidate gene for bortezomib resistance, I'm pretty sure this one's been hammered on hard by my former colleagues. Nobody likes to publish the Journal of Negative Results, which I'm pretty sure is where it would end up. Almost certainly some patients will be found who went from sensitivity to resistance due to mutations in PSMB5, but at the moment it's not the long-awaited (and much desired/needed) central hypothesis of why proteasome inhibition works and which patients it should be used in.

Thursday, July 24, 2008

Another missed Nobel

The newswires carried the story of Dr. Victor McKusick's passing today. McKusick was the first to catalog human mutations (as Mendelian Inheritance in Man, now better known as OMIM in its Online version), and can be truly seen as one of the founders of genomics. I won't claim to know his full biography, but compiling lists of human mutations way back when probably seemed like a bit of an odd task to a lot of his contemporaries.

This follows the sudden passing of Judah Folkman earlier this year in stealing from us a great light in biology, both of whom which the Nobel Committee failed to recognize.

Of course, there are only three Medicine awardees a year (sometimes the biologists sneak in on the Chemistry prize, but clearly McKusick & Folkman would have been in consideration for the Medicine prize). Nobel picking is a strange and unfathomable world. I'm not complaining about anyone unworthy getting it (though the Nobels have some serious closeted skeletons from the early days -- prefrontal lobotomies for all!), but it's too bad so many miss out who would deserve it.

Monday, July 21, 2008

The curious case of the proteasome inhibitor Argyrin A

A burning set of questions in my old shop when I was there, and I have every reason to think is still aflame, is why does Velcade work in some tumors but not others and how could you predict which tumors it will work in. Does the sensitivity of myelomas & certain lymphomas generally (and a seemingly random scatter of solid tumor examples) to proteasome inhibition follow a pattern? And is this pattern a reflection of the inner workings of these cells or more how the drug is distributed throughout the body?

An even broader burning question is whether any other proteasome inhibitor would behave differently at either level. Would a more potent inhibitor of the proteasome have a different spectrum of tumors which it hit?

Now, while Velcade (bortezomib, fka PS) is the only proteasome inhibitor on the market, it will probably not always be that. Indeed, since Velcade has proven the therapeutic utility of proteasome inhibition, other companies and academics have been exploring proteasome inhibitors. The most advanced that I am aware of is a natural product being developed by Nereus Pharmaceuticals, which I will freely confess to not really following.

The featured (and therefore free!) article in July's Cancer Cell describes a new proteasome inhibitor, another natural product. Argyrin A was identified in a screen for compounds which stabilize p27Kip1, an important negative regulator of the cell cycle. Kip1 is one of the a host of proteins reported to be an important protein stabilized by proteasome inhibition (one of duties back on Landsdowne Street was to catalog the literature on such candidates). While there are probably many ways to stabilize p27Kip1, what they reported on is this novel proteasome inhibitor.

By straightforward proteasome assays Argyrin A shows a very similar profile to Velcade. That is, the proteasome has multiple protease activities which can be chemically distinguished, and the pattern of inhibition by the two compounds is very similar. However, by a number of approaches they make the case that there are significant biological differences in the response to Velcade & Argyrin A.

Now there is a whole lot of data in this paper & I won't go into detail on most of it. But I will point out something a bit curious -- very curious. They performed transcriptional profiling (using Affymetrix chips) on samples treated with Velcade, Argyrin A, and siRNA vs an ensemble of proteasome subunits, each at different timepoints. In their analysis they saw lots of genes perturbed by Velcade but a very small set perturbed by Argyrin A and the siRNA. Specifically, they claim 10,500(!) "genes" (probably probesets) for Velcade vs 500 for Argyrin A. That's a huge fraction of the array moving!

Now, I'll confess things are a bit murky. Back at MLNM I would have had the right tools at my disposal & could quickly verify things; now I have to rely on my visual cortex & decaying memory. But when I browse through their lists of genes for Argyrin A in the supplementary data, I don't see a bunch of genes which are a distinct part of the proteasome inhibition signature. At MLNM, huge numbers of proteasome inhibition experiments were done & profiled on arrays, using a number of structurally unrelated proteasome inhibitors in many different cell lines. Not only does a consistent signal emerge, but when an independent group published a signature for proteasome inhibition in Drosophila there was a lot of overlap in their signature & our signature once you mapped the orthologs.

What's the explanation? Well, it could be that I'm not recognizing what is there due to poor memory, though I'm pretty sure. One thing that is worrisome is that the Argyrin A group's data is based on a single profile per drug x timepoint; there are no biological replicates. That's not uncommon due to the expense and challenge of microarray studies, but good experiments are easy. Nor was there any follow-up by another technology (e.g. RT-PCR) to show the effects across biological replicates or other cell lines. Given that these are in tissue culture cells, which can behave screwy if you stare at them the wrong way, that's very unfortunate. Even small differences in the culturing of the cells -- such as edge effects on plates or humidity differences, can lead to huge artifacts.

Another possible explanation is that the Bortezomib cells were watched too late; the first Velcade timepoint is at 14 hours. After 14 hours, the cells are decidedly unhealthy and heading for death. The right times to sample were always a point of contention, but one suggestion that there is an issue is the lack of correlation between the different timepoints for Velcade vs the strong correlation for the other treatments (Figure 7). That works (in my head at least) in reverse too -- it's downright odd that their other treatments are so auto-correlated between 14 and 48 hours with Argyrin A -- if cells are not yet dead at 14 hours but committed to die, one would expect there to be some sort of movement away from the original profile.

One other curiosity. They do report looking for the Unfolded Protein Response (UPR) and report seeing it in the Velcade treated cells but not Argyrin A treated ones. The UPR is the cell's response to misfolded proteins -- and since disposal of misfolded proteins is a role of the proteasome, it has never surprised anyone that the UPR is induced by proteasome inhibitors. Can you really have a proteasome inhibitor that doesn't induce the UPR? If this is truly the case, it is very striking and deserves its own study.

Is the paper wrong? Obviously I can't say, but I really wonder about it. I also wonder if the referees brought up the same questions. Hopefully we'll see some more papers in the future which explore this compound in a wider range of cell lines and with more biological replicates

Nickeleit et al
Argyrin a reveals a critical role for the tumor suppressor protein p27(kip1) in mediating antitumor activities in response to proteasome inhibition.
Cancer Cell. 2008 Jul 8;14(1):23-35.

Wednesday, July 16, 2008

Forging into the gap

Gaps are important. There is a major brand by that name. Controversy over a perceived "missle gap" was a major issue in the Nixon-Kennedy election of 1960. Budget gaps cause governments to trim services. About a half an hour's drive west of where I grew up is the town of Gap, and a bunch of generations ago my ancestors probably passed through the Cumberland Gap.

Gaps occupy a special place in computational biology, specifically in the alignment of sequences and structures. As sequences evolve, they can acquire new residues (insertions) or lose residues (deletions), and so if we wish to align a pair of sequences we must put a gap in. Pairwise algorithms such as Needleman-Wunsch-Sellers and Smith-Waterman insert the optimal gaps -- given certain assumptions which include, but are not limited to, the match, mismatch, gap insertion and gap deletion penalties. Some pairwise alignment problems have been addressed by even more complicated gapping schemes. For example, if I am aligning a cDNA to a genomic sequence I may wish to have separate consideration of introns (a special case of gaps), gaps that would insert or remove multiples of three (codons) or gaps which don't all in either of those categories.

Multiple sequence alignment gets even harder. There are no exact algorithms to compute a guaranteed best alignment, so all methods have some degree of heuristics to them. Many algorithms are progressive, first aligning two sequences and then aligning another to that alignment and then another and so on, or perhaps aligning pairs of sequences and then aligning the aligned pairs and so on. Placement of gaps becomes especially tricky, as their placement in early alignments greatly influences the placement in later alignments, which could well be a bad thing.

Protein alignments in particular have the problem of trying to serve three masters, who are often but not always in agreement. An alignment can be a hypothesis of which parts of a protein serve the same role, a hypothesis as to which amino acids occupy similar positions in space, or a hypothesis as to which amino acids derive from codons with a shared ancestry. Particularly in the strongly conserved core of proteins these three are likely to be in agreement, but in the hinterlands of structural loops in proteins or disordered regions it's not so clear. There is also a bit of aesthetics that comes in; alignments just look neater and simpler when there are fewer gaps. Perhaps not quite Occam's Razor in action, but simplicity is appealing.

The June 20th issue of Science (yep, Science & Nature have been piling up) has a paper that addresses this issue and builds an algorithm unapologetically aligned to just the one goal: find the most plausible evolutionary history. They point out that while insertions and deletions are treated symmetrically by pairwise programs, they are quite asymmetric for progressive multiple alignment. The alignment gets to pay once for deleting something, but insertions (like overdue credit cards) incur a penalty with each successive alignment. It seems unlikely that nature works the same way, so this is undesirable.

One solution to this has been to have site-specific insertion penalties. Loytnoja & Goldman point out that this compensation often doesn't work and causes insertions to be aligned which are not homologous, in the sense that they each arose from a different event (indeed, these insertions should not be aligned with anything from an evolutionary point-of-view, though structurally or functionally an alignment is reasonable).

As an alternative, their method flags insertions made in early alignments so that they are treated specially in later alignments. The flagging scheme even allows insertions at the same position to be treated as independent -- they neither help nor penalize the alignment and are reported as separate entities.

Using synthetic data they tested their program against a number of other popular multiple aligners and found (surprise!) it did a better job of created the correct alignment. They also simulated what getting additional, intermediate data does for the alignments -- and scarily for the older alignment programs gap placement got worse (less reflective of the actual insertion/deletion history of the synthetic data).

The article closes with an interesting question: has our view of sequence evolution been shaped by incorrect algorithms? Is the dominant driver of sequence change in protein loops point mutants or small insertions/deletions.

Phylogeny-Aware Gap Placement Prevents Errors in Sequence Alignment and Evolutionary Analysis
Ari Löytynoja and Nick Goldman
http://www.sciencemag.org/cgi/content/abstract/320/5883/1632
p. 1632

Tuesday, July 15, 2008

If life begins at conception, when does life start & when does it end?

Yesterday's Globe carried an item that Colorado is considering adopting a measure which would define a legal human life as beginning at conception. Questions around reproductive ethics and law raise strong emotions, and I won't attempt to argue either one of them. However, law & ethics should be decided in the context of the correct scientific framework, and that is what I think is too often insufficiently explored.

Defining when life "begins" is often presented as a simple matter by those who are proponents of "life begins at conception" definition. However, to a biologist the definition of conception is not so simple. Conception involves a series of events -- at one end of these events are two haploid cells and at the other is a mitotic division of a diploid cell. In between a number of steps occur.

The question is not mere semantics. Many observers have commented that a number of contraceptive measures, such as IUDs and the "morning after" pill would clearly be illegal under such a statute, as they work at least in part by preventing the implantation of a fertilized egg into the uterine wall. Anyone attempting to develop new female contraceptives might view the molecular events surrounding conception as opportunities for new pharmaceutical contraceptives. For example, a compound might prevent the sperm from homing with the egg, binding to the surface, entering the egg, discharging its chromosomes, locking out other sperm from binding, or prevent the pairing of the paternal chromosomes with maternal ones (there's probably more events; it's been a while since I read an overview). Which are no longer legal approaches under the Colorado proposal?

At the other end, if we define human life by a particular pairing of chromosomes and metabolic activity, then when does life end? Most current definitions are typically based on brain or heart activity -- neither of which is present in a fertilized zygote.

Again, the question is not academic. One question to resolve is when it is permissible to terminate a pregnancy which is clearly stillborn. Rarer, but even more of a challenge for such a definition, are events such as hydatiform moles and "absorbed twins".

In a hydatiform mole an conception results in abnormal development; the chromosome complement (karyotype) of these tissues is often grossly abnormal. Such tissues are often largely amorphous, but sometimes recognizable bits of tissue (such as hair or even teeth) can be found. Absorbed twins are the unusual, but real, phenomenon of one individual carrying a remnant of a twin within their body. Both of these conditions are rare (though according to Wikipedia in some parts of the world 1% of pregnancies are hydatiform moles!) but can be serious medical issues for the individual carrying the mole or absorbed twin.

Are any these questions easy to answer? No, of course not. But they need to be considered.

Wednesday, July 09, 2008

Do-it-yourself genomics: bad advice is bad advice

GenomeWeb's frequently entertaining Daily Scan notes that Wired magazine has a wiki which gives instructions on how to explore your own genome, including how to do your own genetic testing by home-PCRing your DNA and sending it to a contract lab for sequencing.

It isn't a very good idea, but that doesn't mean people won't try it. Doing a simple PCR really is pretty easy; I've done it in a hotel ballroom (proctoring a high school science fair sponsored by Invitrogen). Instructions for homebrew thermocyclers are surely out there; a number were published in the early days of PCR. But that doesn't mean getting good results is easy. Sticking to a purely technical level, are Wired's instructions very good?

I'd say no. I suppose I should even register to edit the wiki, but at the moment I'll limit myself to pointing out some of the technical issues that are ignored or glossed over (the material I quote below may well change, since it is a wiki).

The first obvious area is primer design. Wired's instructions are pretty simple

Designing them may be the hardest step. Look up the DNA sequence flanking your genetic marker of interest in a database like dbSNP. Pick a segment that is about 20 bases long and slightly ahead of the marker. That is your forward primer. Pick another 20ish base sequence that is behind the region of DNA that you want to study. Use a web app of your choice to find its reverse complement.

Alas, this will frequently be a recipe for disaster. As for my own qualifications for making that claim I will state that (a) I regularly design PCR amplicons in my professional life and (b) I have a much greater appreciation for my ignorance about how PCR can go awry than the average biologist. Leading the list of pitfalls is designing a primer with too low a Tm -- if those 20 nucleotides are mostly A & T, it won't work well. Second would be if the two primers will anneal to each other; you'll get lots of primer-dimer and little else. Equally bad would be a primer that can prime off itself. Third would be if the primers aren't specific to your targeted region of the genome. Prime off a conserved Alu piece and you are in real trouble.

The really silly part about this advice is that there are free primer design programs all over the internet, and some of the sites will perform nearly all of the checks mentioned above.

The rules for placement are much trickier than suggested. If you are going to sequence (and you might be sequencing heterozygous DNA; see below), then you really need the primers to be at least 50 nucleotides away from what you care about -- there is a front of unincorporated dye which often drops the quality any closer than this.

Even more of a concern is the sequence data itself. Wired makes it sound easy

Once that's done, you can buy sequencing equipment and do it yourself, or send the sample off to any one of many sequencing companies and they will do it for about five dollars.

If you are sequencing uncloned PCR products, then you are sequencing a population. If you are heterozygous for a single nucleotide, that means that nucleotide will read out as a mix -- two overlapping peaks of perhaps half height. A deletion or insertion ("indel") will make the trace "double peaked" from that spot on.

Those are the best case scenarios. If you had poor quality amplification (due to badly designed primers or just a miserable to amplify region), all those truncated PCR products will be in the sequencing mix as well -- further degrading your signal. If your SNP is in a region expanded due to copy number variation, then life is even harder.

Which gets to another point: Wired seems to be ignorant of copy number variants. Their testing recipe certainly won't work there.

The idea of untrained, emotionally involved individuals trying to interpret good genetic data is scary enough (Wired's example of celiac disease, as pointed out over at DNA and You, is a particularly problematic one); scarier is to overlay lots of ambiguity and error due to sloppy amateur technique. Hopefully, few will have the energy & funds to try it.

Monday, July 07, 2008

History Forget: How not to explain the impact of Prozac

Having escaped the usual abode for the weekend, there were a pile of the accumulated newspapers to digest on the train this morning. The Sunday Globe Ideas section caught my eye with an item by Jonah Lehrer titled "Head Fake: How Prozac sent the science of depression in the wrong direction". It's not an awful article -- once you get past that subtitle. But, it isn't a great article either.

The article puts forth the thesis that Prozac led to a chemical theory of depression, which recent literature has seriously upended. Alas, that greatly distorts the history.

Prozac was not the first successful drug nor the real antecedent to a chemical theory of depression. Early antidepressives such as the tricyclics and monoamine oxidase inhibitors opened the path to thinking that depression was due to imbalances in specific neurotransmitters. Prozac itself, as a Selective Serotonin Reuptake Inhibitor (SSRI), was an outgrowth of that work -- given the previous success with psychoactive drugs which seemed to affect many neurotransmitters and evidence that specific neurotransmitters might be more important for specific psychological diseases, it was natural to try to zoom in on one neurotransmitter. Prozac then is not a paradigm shifter (ala Kuhn) but was an extension of the existing paradigm. The success of SSRIs, partly due to a significantly attenuated side effect profile and partly due to a lot of popular press and partly due to marketing, merely pushed an existing theory up the ranks, particularly in the popular zeitgeist.

Lehrer does do a nice job of summarizing some recent work suggesting how antidepressants may really work, which is that they may help neurons heal (a new paradigm of depression as a neurodegenerative disease). In a recent conversation a clinician acquaintance noted to me some of the same key points (I'll confess to having not read the literature myself), so there's nothing wrong here. He also notes that it was the investigation of inconsistencies of observation with the predictions of the chemical imbalance theory, such as the frequently observed time lag between beginning antidepressant therapy and seeing results, which led to the new theory.

But getting back to that irksome subtitle, did Prozac steer "the science of depression in the wrong direction" or simply on a winding path? Yes, the chemical imbalance theory looks like it may be down for the count. However, it was that very same theory, via its shortcomings, that led to the new theory. This is how science works -- it's often indirect & messy. That's an important message that's lost (or nearly so) in the piece. SSRIs were perhaps a blunt tool, but they are the tool which has unlocked a new understanding of the topic.

Could we have gotten to the current understanding of depression without SSRIs and other chemical antidepressants? That's an exercise in alternative history best left to experts in the field, if anyone. Perhaps we might have, but perhaps not -- or would have via an even more tortuous path. It is important to get out the story of how pharmaceutical antidepressants do and do not work, but it is equally important to get out the story of how science really works.

Thursday, July 03, 2008

Myeloma unified?

Multiple myeloma is a complex disease. Perhaps one metaphor is that of the mythical Hydra -- each time a new molecular tool is thrown at it the number of vicious heads increases. For example, there are different chromosomal translocations which lead to myeloma. If you look at myeloma samples by transcriptional profiling, then one can find distinct expression signatures for each translocation -- and just as easily find ways to split those signatures into further subtypes. For example, some translocations activate one gene disrupted by the translocation whereas other instances of the same translocation will activate both deranged genes.

Another possible metaphor is the old fable of blind men examining an elephant -- each reports that the object is different, based on examining a different portion of the beast. In the case of myeloma, one examiner might focus on the subset with large portions of the genome amplified, others on specific deletions on chromosome 13, another on those cases where bone destruction is rampant. My own experience with palpitating the pachyderm looked at the response to a specific drug.

Now the Staudt lab has come out with a paper in Nature which proposes lumping everything back together again. Initially using a retroviral RNAi screen they identified the transcription factor IRF4 as a unifying theme of myeloma. IRF4 is activated in one characteristic translocation and plays an important role in B-cell development, so it's not a total shock. But linking it across multiple types is surprising.

The screen achieved 2-8 fold knockdown of IRF4 in 3 different myeloma cell lines, each possessing a different hallmark translocation (one of which was an IRF4 translocation). This was later extended to additional myeloma lines with similar lethality, but the knockdown of IRF4 in lymphoma lines had little effect, save one line possessing a translocation of IRF4.

One interesting surprise is that with the exception of the known IRF4 translocation bearing line, none of the lines have amplifications or other obvious derangements of IRF4. Only one showed point mutations upon resequencing. Hence, somehow IRF4 is being activated but not via a painfully obvious mechanism.

RNAi approaches can suffer from off-targets, genes not meant to be hit which cause the phenotype being studied rather than the believed target. The paper provides strong evidence that the effects really are driven by IRF4 knockdown -- not only were multiple shRNAs targeting IRF4 found to kill myeloma cells, but one of these targets the 3' untranslated region of IRF4 -- and the phenotype could be rescued by expressing IRF4 lacking the 3' UTR.

Transcriptional profiling of the knockdown lines in comparison with parental lines revealed a number of candidate IRF4 targets, and a large number of these were also identified by chromatin immunoprecipitation-chip (ChIP-chip) studies, confirming them as direct IRF4 targets. As noted, some direct targets may have been missed by ChIP-chip due to limitations with the arrays used. One other interesting aspect: the IRF4 target list in myeloma lines somewhat resembles a union of that in plasma cells (the normal cell myelomas are most kin to) with that of antigen-stimulated B-cells.

A particularly interesting direct IRF4 target identified in this study is the notorious oncogene MYC. A number of identified IRF4 targets are also known MYC targets, suggesting synergistic activation. They also found that both IRF4 and MYC bind upstream of IRF4 -- suggesting a complex web of positive feedback loops.

An interesting further bit of work targeted various identified IRF4 targets and showed these knockdowns to be lethal to myeloma cell lines. Hence it is suggested that IRF4 ablation in myeloma would lead to tumor cell death by many routes. Mice heterozygous for IRF4 deletion are viable, suggesting that IRF4 could be targeted safely.

The catch would be targeting IRF4 -- transcription factors are on nobody's list of favorite targets. The authors cite as points of optimism approaches targeting p53 & BCL6. However, the p53 targeting route is by inhibiting an enzyme which destabilizes p53, so an analogous approach to IRF4 would require first identifying key determinants of its stability. The BCL6 example they cite uses a peptide mimic, not something the medicinal chemists love much.

Other approaches to targeting IRF4 might focus on "druggable" (if any) genes in the IRF4 target lists, or perhaps something else. I'll try to put together a post next week on one of those candidate elses.

Now that Staudt's group has brought things together, it is tempting to contemplate slicing off some more Hydra heads. How do IRF4 target gene profiles differ across the chromosomal abberation subtypes of myleoma? Do IRF4 targets have any predictive value for determining the appropriate medication or show differential response to different medications?

Monday, June 30, 2008

Laying the groundwork for the one ton tomato

Somewhere in life I've heard a children's/novelty song about a one ton tomato; eventually (if I remember correctly) it ends up as a similar quantity of ketchup.

Nearly half-ton pumpkins show up pretty regularly at the big agricultural fairs every fall, but tomatoes aren't in that league. But, the difference between an ancestral tomato (small berries) and a multi-pound beefsteak is nothing to sneeze at. Domestication has made great strides.

A paper last month in Nature Genetics laid out part of this process. Interestingly, there are two different developmental processes that have been utilized to enlarge tomatoes. A tomato fruit is composed of multiple subunits, the carpels. One change has increased the number of cells per carpel by tinkering with the cell cycle -- a much more delicious change than what a similar process will yield in a person. The new work details the genetic change which increased the number of carpels.

Of course, of interest is how universal these mechanisms are. Most domestic fruits are greatly enlarged over their wild counterparts -- though perhaps raspberries show very little enlargement & blueberries it is a small multiple. On the other end are those monster curcurbits at the fair and their watermelon cousins.

But getting back to the title. Now the question is whether these mechanisms have reached their biological maximum or simply what a few mutations can do (there are also practical considerations, such as the stem strength required to support larger tomatoes). Or, can we use this new knowledge to bring up the laggards -- or figure out why there are no fist-sized raspberries or basketball-like blueberries? A strawberry the size of my dog? Of course, purely economic forces might lead to the fruits commanding the most money per unit weight -- perhaps pomegranates will have an order of magnitude more seeds! Healthy for you -- so long as you watch where you eat them.

Friday, June 20, 2008

Don't do it Josh!

The Globe this week had a number of articles on the passing of the $1B biotech bill in Massachusetts and the proxy fight for Biogen Idec. But a third item really raised my eyebrows.

Vertex's CEO Joshua Boger announced that Vertex is contemplating moving out of the state. The apparent driver of this is a concern that Vertex might outgrow the Boston area and that now might be the time to move, before the company grows even larger. Previous discussion of moving had produced a striking plan to relocate to the Boston waterfront.

Now, I'll confess a certain personal interest. I'm probably going to be in this area for most of my employment life, so I don't want to see employers leave (I can see Vertex headquarters from my office). Furthermore, I believe big companies like Vertex, BiogenIdec and such have a beneficial effect on their overall corporate neighborhood -- they tend to grow more talent than they need and those persons tend to start new ventures near the old ones.

Which is the point -- people don't really like to move. Yes, some folks will follow their job to the ends of the earth, but a lot of folks won't. So atop the disruption & distraction of moving, a lot of good people will leave in a short timespan. My general prejudice is that planners recognize such costs but then grossly underestimate them.

Why might Vertex be contemplating such a move? The most cynical explanation is to try to extract tax incentives from either Massachusetts or wherever they move to. Such incentives have driven previous moves or new sites, with mixed success. Rhode Island trumpeted extracting Alpha-Beta from Massachusetts, until Alpha-Beta failed in the clinic and disappeared into the dust.

More practically Boston does have its drawbacks & tradeoffs. Traffic is awful; but that's true of a lot of America. Housing prices are insane. Neither of these encourages new workers. On the other hand, the academic & hospital environment is huge and Boston has a decent transit system, which somewhat offsets the traffic issue. It is striking that so many large biotech & pharma have been trying to move in to Cambridge/Boston over the last decade or so (Merck, Novartis, Schering, Astra, Amgen, Sanofi-Aventis, etc).

But in any case, I return to my main argument. I'm sure Vertex could thrive in many places -- Boston is not Mecca, and if they moved they would recover and thrive again -- but after paying a steep price of disruption & lost talent.

Are there other options? One of course is to stick it out in Boston. Another is to have multiple locations, which incurs its own inefficiencies. No solution is perfect. But please leave migrations for the birds!

Sunday, June 08, 2008

Visiting a time capsule

The Next Generation & I went to the Boston Museum of Science today (we're members this year) and one of the exhibits where he lingered was the one of biotechnology.

I was a bit surprised to find that it dated to 1993; I didn't remember it always being in the spot it's in, so either my memory is flaky (not an unreasonable idea) or it was moved or in storage at some time. But it has been out for a while.

Simply looking at the list of sponsors is a bit of a memory jogger. While some are unchanged (BASF, Genencor), some simply went bust (Alpha-Beta), some were absorbed in corporate actions (Genetics Institute, Perseptive Biosystems) while others remain but under somewhat different names (lawyers Hale & Dorr have several more '&' in the name now; Biogen is now Biogen Idec).

Reading the text is interesting too. For example, we can learn that the human genome maybe, possibly might be sequenced one day.

One of the displays proposes that the dye indigo might one day be synthesized by bacteria (which had been demonstrated) instead of synthesized from petroleum (which had supplanted the original natural source about a century ago); that process has apparently not (yet?) become commercially feasible.

One of the games involves performing gene therapy for cystic fibrosis using a cold virus. That's certainly still a dream, but not for lack of trying.

Another game has you adding an antifreeze gene to tomatoes to prevent their freezing; this was once an active pursuit, but I haven't heard anything lately. Certainly the no-soften tomato was a commercial flop; I'm still eagerly awaiting some tomasil seeds.

This isn't meant to ridicule the display; in general I think it was well done & carefully thought out (Aspirin has been misspelled on the display all these years, but oh well!). Making interesting, interactive exhibits on molecular biology themes remains challenging.

Perhaps what has aged the least on the displays was the addressing of ethical concerns -- when does gene therapy go too far, what privacy rights do we have to our genes, etc.

Saturday, June 07, 2008

Isn't The Great Filter something in the Whatman catalog?

Twice in the last week the Globe has run pieces on a concept called 'The Great Filter', once on the Op-Ed page and now in the Star Watch astronomy column. I've read both, and the pseudo-statistical thinking in them just irks me.

The headline on the star watch column suggests the hubris that is perhaps what is goading me: "Why a microbe on Mars would change humanity's future". I'd completely agree that discovering microbial life on Mars would be exciting, but where it goes from there is bizarre.

The gist of the argument can be found in this quote

If life arose independently twice in just one solar system, it would mean that the life formation process is easy and common. Life would be abundant everywhere. Most starts have planets, os the entire universe would be teeming with living things.
Good news? No. The chance for humanity's long-term survival would immediately look worse.
Follow carefully now. Whether or not simple life is common, we know that intelligen, technological life -- like us -- is probably rare. Otherwise, goes the arugment, it would have noticed such a good planet as Earth and come here to colonize as early as hundreds of millions of years ago

.
Given that we haven't yet found signs of other advanced life (or any life) elsewhere

If life is common, something apparently stops it from developing to the point of gaining interstellar travel and settling the galaxy...Apparently, some kine of "Great Filter" preveents life from evolving to the point of getting starships. If the Great Filter lies early in evolution -- such as if the origin of life itself is a rare fluke -- then we, humanity, have already gotten through it. If the Great Filter lies ahead of us -- such as, for instance, if technological civilizations always destroy themselves as soon as they get to power -- then we have no more chance of making it than all the others who have failed and left the cosmos silent.
The more advanced the fossils of living things that Mars may hold, the greater the chance that the Great Filter lies not behind us but ahead.

Okay, just where to start. First, the current Mars mission finding life on Mars is a far cry from finding that life arose independently on Mars. We know that rocks make the transit occasionally, and while we think we sterilized all the probes, the possibility that any life form found really shares a common heritage must first be ruled out. Gary Ruvkun has suggested an experiment for a future probe to look for & sequence ribosomal RNA (if I remember correctly); that would be an appropriate follow-up.

There's also the problem of an N of one: Mars is one planet. Maybe you count an N of 2 with Earth as the second case, though since you're trying to predict on it that's a case of training on your test set. Mars is hardly an independent sample; the same solar system, which may or may not have some unusual properties.

But perhaps more irksome is conflating the reasonable idea that there are difficult barriers against spacefaring species to arise with the rather silly one that there is a single "Great Filter". Mars is a particularly poor example, as we would have a good guess what the filter is there: the planet quit being a nice place to live.

How improbable is life? How often do planets get life but it stays unicellular? How often multicellular but never ambulatory, sentient beings? How often do those sentient beings come up with some way to prevent travel to the stars -- a religion that forbids it, self-extermination (which our species has toyed with). Perhaps some inhabited planets have a super Van Allen belt which dissuaded their residents from becoming star travelers. Perhaps there are intelligent cultures far away -- but with a timing such that their signals can't yet reach us.

The fact is, any estimates of the probability of any one of these (or anything else you can imagine) are nothing but personal priors, wild guesses without much basis in fact. Feel free to make them, but spare us the headlines about predicting doom and gloom.

Thursday, June 05, 2008

Cuddle up to a phage!

While searching Amazon for a book, I came across a very funny (in a geeky way) line of plush toys: all sorts of microbes! GiantMicrobes.com has quite a taxonomy of them. I think my visual favorite is the T4 phage

, but there's lots of other fun stuff here.

You can get a whole range of common (E.coli) and nasty (a whole line of venereal disease agents. Human pathogens are not monopolized: to terrorize Miss Amanda (or make voodoo chew toys) there's mange, rabies & heartworm.

The E.coli are a flagellated strain. You can buy one or a trio (Petri dish)

. Surprisingly, there isn't a package deal on T4+E.coli, nor do they (yet?) have a pBR322 to accessorize your E.coli. Perhaps a future product line extension will include GFP-expressing glow-in-the-dark variants, or perhaps some scent-enhanced ones.

Monday, June 02, 2008

House ATG.GAC.

I don't watch a lot of network television, but there are a handful of programs that have latched onto me. At the end of this season, there were just two and by accident rather than design (or perhaps it is the current plethora of such) they are both hospital-based. Last week I viewed the last of the new episodes off my PVR – so in place of a new episode this week, I’ll try to sketch out my own

House M.D. is an hourlong drama focusing on Dr. Gregory House, a brilliant diagnostician who is also an extremely difficult human being. He terrorizes his three junior colleagues, who are trapped in his orbit like the inner moons of Jupiter -- and subject to similar violent (though only psychologically) tidal forces. Three previous assistants have attained somewhat more distant orbits, though one has spiraled back in. His boss & a colleague attempt to be friends, but get much grief for their efforts.

As with most series TV, there is a basic formula, a framework which the writers decorate or modify each week, rarely breaking it entirely. The scheme here generally starts with a patient arriving with some strange, dramatic set of symptoms (usually exposited prior to the opening credits). House is either intrigued or blackmailed by his boss into taking the case Lots of diagnostic dead ends follow (and new symptoms appear), accompanied by exorbitant amounts of testing. House's assistants provide the union of all high tech medicine & are capable of running any diagnostic under the sun (somehow, the hospital lacks lab techs!). By the end, the case is solved -- and more often than not the patient survives (a few lose the lottery).

One thing you actually DON'T see much of is DNA testing -- once in a while, but it hardly shows up as much as on a CSI/Law & Order type police procedural. DNA testing just doesn't televise well; the best you can do is show someone drawing their own blood (what, no buccal swabs?). In contrast, the MRI room has lots of fun angles -- private conversations behind the console, bouts of claustrophobia, or dramatic races to reach the suddenly stricken patient. Sequencers just aren't very dramatic.

So, I'm going to suggest an episode. Perhaps this qualifies as a "treatment" in Hollywood-speak. I have no desire for a career there, but if the writers take the idea I'd hardly turn down a walk-on.

A patient arrives at Princeton-Plainsboro seeking House due to a mysterious set of symptoms which has afflicted her for years. As usual with such, House is disdainful -- until the patient tries to hand him a DVD but dramatically collapses instead with some interesting symptom along the way. When the patient regains conciousness in a hospital bed, they start asking about the DVD again -- and then deliver the trump card: the DVD has her genome sequence on it.

House has no great interest in the DVD, and argues how useless it is. He's patently annoyed by it. One of the assistants makes the mistake of rising to the bait and proposing that perhaps a critical clue lies within -- and thereby gets assigned the task of cross-referencing EVERY polymorphism against the patient's symptoms. Several dead ends come from the DNA data, but nothing useful -- or in reality, just too many hypotheses which are too tenuous to do anything with. That doesn't stop the young assistants from batting some around and debating the now and future utility of such scans.

Now, as an aside, the story really (in my opinion) needs a complete genome scan. However, if there is a desire to garner some product placement that would narrow the candidates to one (Knome) at this stage. SNP scans are quite as dramatic!

At the end, the patient's puzzle is solved & they get to proceed in life knowing what they have & able to manage it. But, the kicker is that the assistant now cross-references the now known disease against the polymorphisms and comes up with an answer -- but it was buried deep within hundreds of other equally supported hypotheses. Finish the episode with some more back-and-forth amongst the characters about how this might play out the next time. How their careers might change. How well (or not so well) their training has prepared them for this.