Wednesday, October 31, 2007

Scientific Easter Eggs

Blogging on Peer-Reviewed Research
Tonight, of course, is Halloween, one of the many holidays which in the U.S. has a serious sweet tooth. After taking Version 2.0 around for the tradition of gentle extortion on this day, I indulge in my own rituals -- listening to Saint-Saens & reading The Raven. It isn't exactly the right time, but the confectionery angle got me thinking about other sweet holidays, and then to Easter Eggs -- of the scientific kind.

There was a recent complaint in Nature about the growing shift of information from the printed versions of articles to the Supplementary Online Material (SOM). I can definitely sympathize -- as the writer complained, key details have been migrating to the SOM, meaning that sometimes you can't read the print version and really tackle it scientifically. In particular, Materials & Methods sections of many papers have been eviscerated, with the key entrails showing up in the SOM. Most of the point of my print subscriptions to Science & Nature is to be able to read them during my Internet-free commute. Worse, the SOM becomes an appendage in danger of being lost or misdirected -- such as in a recent manuscript I reviewed which showed up without the supplements.

For better or worse, editors & authors have shared interest in shifting things from print to the SOM. For editors, online is cheap. For authors, it is a way to cram more in to fixed paper size limits. Clearly some material (such as videos) can only go into SOMs, and lots of supporting data really does belong there.

In computer code, an Easter Egg is a hidden surprise -- if you know the right combination of keystrokes or commands or such, something interesting (and generally irrelevant to the program) will show up. I'm not sure I've actually ever seen one -- I'm generally too impatient to deal with such things, but I do recognize they exist. Granted, perhaps some of that programming effort would be better spent wringing a few more bugs out, but it is a way for coders to blow off steam.

I propose that a scientific Easter Egg is the inclusion in Supplementary Online Material of valuable scientific data which is peripheral to the main thrust of the paper, but is nevertheless a significant advance. Such events are probably rare, as it requires a certain mindset to bury a possible Minimal-Publishable-Unit in another paper's SOM, but on the other hand it beats something never being published -- and perhaps it is interesting to some but viewed as too minor to merit a paper.

I'll give you an example, from the Church lab. George has long been burying stuff in papers -- for example, one of the footnotes to the original multiplex sequencing paper declared that the technology was being used to shotgun sequence Salmonella typhi AND Escherichia coli! Alas, the project was ahead of the technology & never completed. But a much better Easter Egg is in the first large-scale polony sequencing paper (PDF ; SOM). Supplementary Figure 2 is really an in-depth study of the site preferences of the Type IIS restriction enzyme MmeI -- driven by about 20K of sequencing examples. This is really a bit of restriction enzymology hiding in a sequencing paper. Because the enzyme is used in the method, it is relevant -- but not quite critical. The enzyme preferences are important because it could create biases in sequence sampling, but it is hardly the main point of the paper -- which is why it is in the SOM.

I'm sure there are even better examples out there. What is the most interesting tangential information you have seen in an SOM?

Shendure, J, Porreca, GJ, Reppas, NB, Lin, X, McCutcheon, JP, Rosenbaum, AM, Wang, MD , Zhang, K, Mitra, RD, Church, GM (2005) Accurate Multiplex Polony Sequencing of an Evolved Bacterial Genome Science 309(5741):1728-32. DOI: 10.1126/science.1117389

Church, G.M., and Kieffer-Higgins, S. (1988) Multiplex DNA sequencing . Science 240: 185-188. DOI: 10.1126/science.3353714

Monday, October 29, 2007

One lap around

Officially, I started this blog a year ago yesterday, with the first post about science coming the next day.

At times, I wonder what possessed me to assign myself a regular writing assignment. But, it's definitely been rewarding. I've learned from comments & emails, made a lot of new connections, and maintained an incentive to read a lot of papers that aren't directly connected to my current professional duties. I've also gotten to indulge my fondness for wordplay and pun-filled headlines.

I thought I knew what I'd write about, and in general I've stuck to it, though I've certainly strayed periodically a bit outside of biotech & bioinformatics (or figured out tenuous links to them). I've also covered some topics more than I ever would have guessed: I had no idea when I started I'd write so much about dogs!

What might I change for the next year? I really should be more active in blog carnivals -- I miss the deadlines far more than I hit them, and have probably shown up in as many from editors being kind as those I've submitted. I really should take some turns at editing a carnival edition. I also plan to try to join the Blogging on Peer-Reviewed Research bandwagon, though with diligence to only claim that icon when I have actually fully read the article (which dings all the papers I have access to only the abstracts!).

Keeping posts on a regular schedule has been challenging, which makes me appreciate all the more folks like Derek Lowe who post intelligent writing like clockwork. Sometimes there are good excuses (internet-free vacations), but too often the writing gets put off until too late at night. I've also noticed a tendency to follow-up flurries of writing with droughts -- the week long marathon in February, for example, followed by a weak March. Need to work on that. 182 posts over a year's time -- averaging to one every other day. I thought I'd be higher than that, but maybe that really is the comfortable hobby level.

While the writing is a solo effort, getting readership has been helped by an army of others. GenomeWeb is nice enough to feature me regularly in their blog, and a number of individual bloggers have helped through blogrolls, carnival invites & cross-links. The DNA Network is now my primary blog read each day (plus Dr. Lowe).

I'm also surprised at the number of article ideas I've let sit on the shelf -- some dating back to near the beginning. Either post them or kill them!

Thanks to all for reading this. I hope I'll continue to earn your eyes for another year.

Saturday, October 27, 2007

Genomics Lemonade

One of the many attractions of the next generation sequencing techniques is that they eliminate the step of cloning in E.coli the DNA to be sequenced. Not only does this step add complexity and expense, but it also detracts from results. Shotgun sequencing attempts to reconstruct the genome from a large random sample of fragments, but there are some pieces of DNA which clone poorly or not at all in E.coli, skewing the sample. These regions have often required labor intensive, expensive targeted efforts to finish.

However, when life gives out lemons, some break out the sugar and glasses. A new paper in Science Express (subscription required) turns this phenomenon around in a clever way. All those failed clonings weren't nuisances, but experiments -- into what can be cloned into E.coli. And since horizontal transfer of genes is rampant in bacteria, it's an important phenomenon with relevance to medicine (virulence genes are often transferred). And on a huge scale: 246K genes from 79 species, using 1.8 million clones covering 8.9 billion nucleotides.

The first filter was to identify short genes which rarely showed up in toto in plasmid clones, looking at short (<1.5Kb) genes since longer ones will rarely be complete in a short insert clone. Now, common plasmid vectors replicate at multiple copies per cell. To further refine the list, the authors also looked for evidence that these genes were underrepresented in long-insert clones, which typically are in vectors which replicate at a few copies per cell.

No one gene was poison from every species, but the 'same' gene from closely related species was often trouble. Species related to E.coli often had more toxic genes, perhaps because these species already had promoters which could drive significant expression in E.coli. So, they took examples from 31 species two such genes (both for ribosomal proteins) 3under the control of an inducible promoter, and showed much greater toxicity when the promoter was turned on. 15 randomly chosen control genes did not show toxicity.

What kind of genes transfer poorly? One major class are proteins involved in the ribosome, a class previously noted to be rarely found amongst genes thought to have been horizontally transferred. One posssible inference for this is that the ribosome is a highly tuned machine, with excess components able to fit in but not fully function. Interestingly, the proteins in direct contact with ribosomal RNA were found to be more likely to be in the toxic set.

Another test was to simply look at what E.coli genes can't be transferred into E.coli -- well transferred from single copy in a wild-like strain to multi-copy in a lab strain. Such genes are probably toxic purely due to dosage effects (such screens have been used to great effect in the past, e.g. this)

What's missing from the paper? Two quick questions came to my mind. First, how many of the genes are essential in E.coli? Second, what if you simultaneously knocked out the endogenous copy and expressed the foreign one -- would that lessen the toxicity?

There are other examples of leveraging trouble into something interesting that I have had some connection to.

During the early 1990's, no sequencing was going fast enough for young, impatient folk, especially E.coli. At one Hilton Head Conference, there was loose talk of a 'schmutz' genome project -- we would go through all the unalignable reads from all the genome sequencing centers, figuring that a significant fraction were E.coli contamination & therefore might help fill in the E.coli genome. Alas, we never actually pushed forward.

When I was at Millennium in the late 1990's, we were mining a lot of EST data from our own libraries, from the public collections, and from the in-licensed Incyte databaes. A constant minor nuisance was the presence of different contaminants in these collections, and at one point I had my group trying to clean this up. We could successfully identify a number of contaminants, which were sometimes very center-specific. For example, the Brazilian EST collections had contamination from the citrus (lemon?) pathogenic bacteria they were sequencing at the same time. I regarded this solely as a cleanup operation, and when we were done we were done -- but of course some people think more cleverly & so I was chagrined to see a paper by George Church and company using this technique to associate bacteria and viruses with human disease.

All that writing has made me thirsty. Lemonade anyone?

Thursday, October 25, 2007

At long last, a 2nd GPCR crystal structure

G-protein coupled receptors, or GPCRs, are a key class of eukaryotic membrane receptors. Roughly 50% of all small molecule therapeutics target GPCRs. Vision, smell & some of taste uses GPCRs. Ligands for GPCRs cover a wide swath of organic chemical space, including proteins, peptides, sugars, lipids and more.

Crystal structures are spectacular central organizing models for just about everything you can determine about a protein. Mutants, homologs, interactors, ligands -- if you have a structure to hang them on, understanding them becomes much easier. For drug development a 3D structure can be powerful advice for chemistry efforts, suggesting directions to build out a molecule or to avoid changing.

Because they are large, membrane-bound proteins with lots of floppy loops, GPCRs are particularly challenging structure targets. Efforts to build homology models relied on bacteriorhodopsin, which is not a GPCR but has the seven transmembrane topology of GPCRs. The first GPCR structure was finally published in 2000, of bovine rhodopsin. Cow rhodopsin has a significant advantage in that large quantities can be purified from an inexpensive natural source, cow eyes.

Since then, published crystallography of GPCRs has been restricted to further studies on rhodopsin (e.g. this mutant study). Rumors of further structures at private groups would periodically surface, but given the lack of publications & the high PR value of a publication, it seems likely these were just rumors. Now, after 7 years, the drought has been ended with a flurry of papers around the structure of a beta adrenergic receptor, the target of beta blockers.

The papers share a number of co-authors but describe two different approaches to solving the GPCR crystallization problem. For the beta-2-adrenergic receptor, a key problem is a floppy intracellular loop. In the pair (here & here) of papers online at Science, the troublesome 3rd intracellular loop is largely replaced with T4 lysozyme, a protein which has been crystallized ad infinitum. In the Nature paper & a Nature Methods paper describing the method, the intracellular loop is stabilized with an antibody raised against it.

The abstracts hint that B2AR and rhodopsin are strikingly different in some important ways, underlining the need for multiple crystal structures for a family -- with only one, it is impossible to determine what is general and what is idiosyncratic. Indeed, one of the papers reports that published homology models of B2AR were more similar to rhodopsin than the new B2AR structure.

Will these new approaches herald a flurry of GPCR structures? Perhaps, but they hint at what a hard slog it may be. A host of additional challenges were faced, such as the crystals being so transparent it was hard to position them in the beam. Will each GPCR present its own challenges? Only time will tell.

Sanguine Thoughts

Sometimes in life, you just want to lie back and stare at the ceiling. Other times, you have no choice, which is how I found myself for a while last Sunday morning. I was lying on a simple bed, staring at the ceiling of a high school gymnasium, with tubing coming out of my right arm.

I hate needles. One of the many reasons med school was out for me is that I hate needles. I can eat breakfast while watching a pathology lecture, but I can't stand the sight of a needle going into human skin (nor a scalpel). My fear of needles was so severe I had to be partially sedated once for a blood draw, which was most unfortunate as I then couldn't scream properly when the nurse speared some nerve or another & nobody realized the agony I was in. Sticking myself as an undergraduate didn't help, though at least the needle was fresh and had not yet gone into the mouse.

So, a number of years ago I resolved to fight this irrational fear by confronting it in a positive manner, and so I started to give blood regularly. For a while I was giving pretty much as often was allowable, but in the last few years I've slipped and missed a lot of appointments. But, the Red Cross still calls & I still get in a few times a year. And the needle phobia has been calmed from abject terror to tense dread, a marked improvement. Plus, I feel like I'm doing some good -- your odds of saving someone's life are certainly better for donating than for entering a career in drug discovery (though the latter has some huge tails -- a lucky few get to make an amazing impact)

Most blood drives are held in conference rooms, and so the ceilings aren't terribly interesting. Gym ceilings don't do much for me either. There is one memorable blood drive location I've been to: the Great Hall of the Massachusetts State House. But generally, it's iPod and random thoughts time.

This time, the iPod was giving me the right stuff (as in the soundtrack for the same), but my thoughts were roaming. Having done this a lot, one compares the sensations to previous times. For example, the needle in had a little more burn than usual, perhaps some iodine was riding in? On the way out was even more disconcerting: a warm dripping on my arm! The needle-tubing junction had just failed, but things were rectified quickly (though I looked like an extra from M*A*S*H while I held my arm up).

But most of all, I remembered why I had come to this particular drive, with no thought of letting the appointment slip. This drive was in honor of five local children with Primary Immunodeficiency, and three of them are from a family we know well. Their bodies make insufficient immunoglobulins, leaving the patients vulnerable to various infections. Regular (sometimes as often as weekly) infusions of immunoglobulins are the treatment for this. Some causes of PI are known, but others have yet to be identified.

Given that this family has two unaffected parents and three boys all affected, my mind wanders in some obvious directions. That pattern is most likely due to the mutation lying on the X-chromosome, which sons inherit only from their mother. Given the new advances in targeted sequencing, for a modest amount of money one could go hunting for the mutation on the X -- perhaps a few thousand dollars per patient. Such costs are certainly within the realm of rather modest charity fund-raising, so will we see raffles-for-genomes in the future?

If such efforts are launched, will patients and their families be tempted to go largely on their own, bypassing conventional researchers -- and perhaps conventional ethical review boards? If anyone with a credit card can request targeted sequencing, surely there will be motivated individuals who would do so. Some, like the parent profiled in last week's Nature, will have backgrounds in genetics -- but others probably won't. Let's face it, with a little guidance or a lot of patient reading, the knowledge can be acquired by someone willing to learn the lingo.

As I was on my way out, the middle boy, who is 5, was heading outside with his mother to play. He looked up at me and said sincerely "Thank you for giving me your blood". Wow, did that feel good!

Friday, October 19, 2007

What do Harry Potter, Sherlock Holmes, Martha's Vineyard & Science Magazine have in common?

As the Harry Potter series went on, more and more of the characters' names telegraphed a key component of their properties. One of the most blatant of these is Sirius Black, (spoiler alert), who turns out to be capable of transforming into a black dog (Sirius being the dog star). Black dogs show up elsewhere in literature: the hound of the Baskervilles is reported to be a huge black hound. On the Vineyard, there is a restaurant/bar whose apparel has spread around the globe with it's black Labrador log, The Black Dog. Now, joining the parade is Science (currently available in full in the Science Express prepublication section to subscribers only), with the identification of the gene responsible for black coat color, a locus previously known as K.

The new gene turns out to be a beta defensin, a member of a family known previously for its role in immunity. Dogs are unusual in having black driven by a gene other than Mc1R and agouti. Mc1R is a G-protein coupled receptor (GPCRs) and agouti encodes a ligand. Strikingly, beta defensins turn out to be ligands for Mc1R, closing the circle.

GPCRs constitute one of the biggest classes of targets for existing drugs, so one of the first tasks of anyone during the genome gold rush was to identify every GPCR they could. However, it is very difficult to advance a GPCR if it lacks a known ligand ("orphan receptor"), so drug discovery groups spent a lot of effort attempting to 'de-orphan' the GPCRs flowing from the genome project -- and very few had much luck. I haven't kept close tabs on the field for a few years, but it would seem there are still a lot of orphans left. Plus, from a physiological standpoint you don't just want to know 'a' ligand for a receptor but the full complement. This work is a reminder that new GPCR discoveries can come from a largely unanticipated angle.

It's been a huge year for dog genetics, and I've touched on a few items in this space. I suspect that someone really in tune to the field could easily fill a blog with it; I just catch the things in the front-line journals and the occasional stray from a literature or Google search. Much of the work this year has been on morphology, and there's still plenty to do. Many dog breeds have common abnormalities and those are beginning to be unraveled as well -- and many will likely have relevance to human traits. One I stumbled on recently is the identification of a deletion responsible for a common eye defect in collies.

The really big fireworks will come when behavior genetics studies really fire up in dog. Some traits have been deliberately bred into particular breeds (think herding & hunting dogs) and others inadvertently (such as anxiety syndromes). Temperament varies by breed, and of course just about any dog is more docile than their wild lupine relatives. There will be lots of interesting science -- and probably more than a few findings that will be badly reported and misinterpreted in the popular press. Let's hope, for his sake, that James Watson keeps his mouth shut about any of it.

BTW, Lupine? -- another telegraph character name. Fluffy, on the other hand, not quite the name you'd expect on a gigantic three-headed dog. Alas, there's only one Fluffy mentioned, so it might not be possible to map the genes responsible for that!

Thursday, October 18, 2007

Chlamydomonas swims across the line

Last week's Science contained the publication of the Chlamydomonas reinhardtii genome, an old friend of mine from my undergraduate days. One thing I find particularly illuminating is how the focus of Chlamydomonas research has shifted.

Chlamydomonas has been studied for a long time, and was the system where the uniparental genetics of organelles was discovered. Chlamy has two flagella, and a lot of genetics on flagellar function had been performed in the system. But, in general it was viewed as a convenient model system for studying photosynthesis and nutrient uptake. If I remember reasonably well, in the late '80s it was probably 75:25 plant physiology:flagellar function in the literature, and the flagellar work was viewed as basic cell biology. Most publications were either in basic cell biology journals or plant journals, with the most notable paper in a flashy journal being the report of a separate basal body genome -- a finding which has not withstood the test of time.

Around the time I was graduating, it looked like interest in Chlamy might fade. Genetic transformation had finally been developed, but a new model plant had shown up: Arabidopsis. It had many of the desirable characteristics of Chlamy (such as packing a lot into a small space), but the molecular genetic tools were being developed amazingly rapidly & as a land plant (and relative to some of kids' least favorite vegetables) appeared more desirable.

Chlamy's two flagella make it unusual, as land plants and fungi lack flagella. So the genome paper, and some earlier papers, really pounces on this. Flagella have gone from just being interesting cellular structures to interesting cellular structures with a lot of human disease interest. By performing various taxonomic comparisons, genes can be identified as present in all flagellum-bearing species but no non-flagellated ones, being conserved in photosynthetic eukaryotes but universally absent from non-photosynthetic ones. Lots of good stuff there.

What next for the plant that swims? Googling & PubMed reveal interest in biofuels & bioremediation. Chlamydomonas is hot -- and going to stay that way.

Wednesday, October 17, 2007

When Personal Genomics is Very Personal

Anyone interested in personal genomics should hunt down the new Nature (available online at the moment) and read the story of Hugh Rienhoff, whose third child (a daughter) was born with a still mysterious set of symptoms. Since her birth he has been bouncing around trying to get a diagnosis for her condition which resembles Marfan's and a similar disorder called Loeys–Dietz.

Rienhoff was trained as a physician under Victor McKusick and helped start a genomics firm (DNA Sciences), so he was a bit primed for this. Remarkably, he has apparently set up his own PCR laboratory in his house so he can perform targeted sequencing of candidate genes from his daughter's DNA -- using an unnamed contract research house. Alas, none of these searches have yet turned anything up.

Because of the similarity of his daughter's symptoms to the other two syndromes & because both of these syndromes involve TGF-beta signalling, as well as the well characterized role of TGF-beta signalling in muscle development & his daughter's muscular problems, Rienhoff & her doctor recently decided to put the child on a high blood pressure medication which is suggested to reduce TGF-beta signalling and to help in a mouse Marfan's model.

The story is a good illustration of the promise -- and the complications -- of cheap DNA sequencing to identify the causes of rare diseases. Small scale targeted sequencing hasn't worked out -- but given the large number of genes known to be involved in TGF-beta signalling the odds were never wonderful. Perhaps a full genome scan, or targeted resequencing using one of the new array-based capture schemes, might find a strong candidate mutation -- some of the other TGF-beta related syndromes are dominants, so perhaps this will be too & comparing the daughter's scan to the parents will single out the mutation. But, the results might be inconclusive -- no strong candidates. Or, perhaps a candidate is found because it is a de-novo mutation in the child & is likely to have a major effect (non-synonymous substitution, truncation mutant, etc), but in an utterly unstudied gene. At least that's something to go on, but not much.

The article touches on how patients with unusual clusters of symptoms often get lumped into 'dustbin' categories, syndromes whose common thread is an inability to assign the patients to another category. Personal genomics may be quite useful for cutting down on such diagnoses, as the genetic data may sometimes provide the compass to guide through the morass of symptoms. On the other hand, there will probably be whole new bins of genetic syndromes -- 'polymorphism in X with skeletal defects' -- again, it is something to go on, but they are almost guaranteed to pile up much faster than the experiments to sort them out can be run.

After reading the article, I can't help but hope that his daughter gets into one of the big sequencing programs, such as the recently announced Venter center 10K genome effort. There will be a lot to be gained by finding out the ordinary variation which makes each one of us different, but there should also be a bunch of slots reserved for patients for whom sequence results might, if they are lucky, give them some new options in life.

Tuesday, October 16, 2007

Innumeracy at the highest levels

I admire Richard Branson for his many entrepreneurial and adventuring efforts. I am especially wishing for the success of his spaceflight venture -- when Millennium changed travel companies a few years back I put Virgin Galactic at the top of my carrier preference list. Maybe I can arrange a business trip in the future.

But it is clear that Branson isn't the one doing the engineering math -- or let's hope so. I happened to scan a Boston Herald at a restaurant tonight -- I'm no fan of the Herald, but I'm a compulsive enough reader I'll skim it if it's free -- and saw that Branson had spoken before a business group in Boston. He is quoted as saying
You’ll go from (zero) to 4,000 miles an hour in 10 seconds - which will be quite a ride

Presumably Branson's gotten caught up in the thrill of the flight idea, but that's just ludicrous -- not that the Herald caught it. I haven't done such calculations since college physics, but with a little Excel help & my three best-remembered Imperial conversion factors (5280 ft/mile, 12 inches/foot & 25.4 mm/inch) and checking my memory of g in Wikipedia (remarkably, I remembered it!), the miles/h -> inches/hr -> mm/hr -> m/s series puts that at 178.8g! According to Wikipedia, the highest known G-force to be survived was 180+g in a race car accident. Amusement park rides don't even pull 10g (according to the same entry). Given the flight profile of SpaceShipOne, which is the basic technology platform for Virgin Galactic, a more realistic flight profile is 500 miles per hour -> 4000 in two minutes would be a more plausible 14.9g

Such innumeracy is frequently present in media articles in one way or another. Given this poor foundation, how will we ever equip patients to intelligently use genomic profile information? Surely there will be many good, trained persons stepping into that void, and just as surely there will be plenty of hucksters and worse.

Monday, October 15, 2007

Nobel Silly Season

For a number of years now I think of early October as Nobel season. With the prizes often come two rounds of silliness.

The fun silliness are the Ig Nobel prizes. Very silly, the humor is often juvenile, but they are also fun, poking fun at research on the fringe in one way or another. I've attended one ceremony and it is worth doing once (more if you enjoy it the first time).

The ridiculous silliness involves various media reports treating the geography of science Nobel prize awards as some sort of barometer of the state of science in those regions. A year or so ago Nature was moaning over the lack of European laureates. I can't find a link, but this year the talk was about the lack of American science Nobels (no, Al's Peace Award doesn't count as science!) and the dominance of Europeans. This was particularly absurd since 2 of the 3 physiology awardees did their work at American universities! Here is what appears to be Smithies' first mouse knockout paper, and the institution listed is U Wisconsin. Capecchi's came from U Utah.

But even if all the Nobels went to researchers at Lilliput, that would be useless for judging the state of science anywhere. Nobels generally go for work done many years before -- so if they say anything, it would be about the state of science 1-2 decades ago -- and they are hardly useful for that. The Nobel prizes are great opportunities to learn about top notch research, but they are just an idiosyncratic sampler, not a representative sample.

Friday, October 12, 2007

National Wildlife Genomics

Visitors to our house are likely to quickly notice a recurrent theme in the decor, starting with a garden ornament and continuing throughout the house. Pictures, books, dog toys -- even a trash can, with a common two-color scheme. Or, for those who think that way, two non-colors. An inspection of The Next Generation's quarters will reveal the mother lode: melanoleuca run amok. The house bears a bi-color motif: a motif of bi-color bears. Yes, we pander to pandas!

It is therefore with interest to see (thanks to GenomeWeb!) an item from Reuters that the Chinese government is funding a project to sequence the panda genome prior to the 2008 Beijing Olympiad. Wild pandas are found only in China and are considered a national symbol & treasure.

A panda genome should be of great interest to evolutionary biologists, as the panda is a bit of an odd bear. Indeed, until the arrival of molecular systematics its affinity for bears was unclear, with alternate groupings putting them on their own or with raccoons along with red pandas (which are not bears). With the lag time in populating libraries and such, the doubt about their taxonomy persists in many schools and many minds: TNG has already been tutored to defend the ursinity of Ailuropoda with the DNA argument. Pandas have adopted a nearly vegetarian lifestyle, consuming mostly bamboo -- and their digestive tracts probably haven't quite caught up to that change. Anatomical variations, such as the famous panda's "thumb", might also have detectable traces in the genome. Perhaps even some genetic drivers of their extreme cuteness will be identified!

However, if you were picking a bear to sequence for physiological insight, I'm not sure you'd pick pandas, as they don't hibernate, and hibernation is surely a fascinating topic. All those metabolic changes must leave an imprint on the regulatory circuits.

There is a clear solution to that. China is hardly the first country to sequence wildlife genomes identified with that country: the Aussies have been hopping through the kangaroo genome. So perhaps the Canadian's could go after the polar bear genome so the world can have a good hibernating bear to compare with the non-hibernating panda.

What other genomes might be sequenced as a matter of national pride? Are the New Zealanders launching a kiwi genome project? An Indian tiger (or king cobra) project? A Japanese crane sequence? One almost yearns for the lost central European monarchies, as then we would find out the genes responsible for a double-headed eagle.

Thursday, October 11, 2007

Opus #173, Programming on the Dark Side (C#)

I had commented a while back that I was contemplating shifting my programming focus from Perl to another language. The existing code base is split between C# and Python, with more C# but with a lot of code I need to think about in Python. I gave both a bit of a trial and also took some suggestions, and did come to a decision.

Hands down, C# is my language.

Now, language choice is a personal matter, and I don't dislike Python -- at some point I'll write down more impressions -- but C# is a great match. I really do like a strongly typed language, both from the standpoint of catching lots of silly mistakes at compile time rather than runtime but also because the typing provides lots of cookie crumbs for trying to reason out someone else's code (or old code of your own). That could also make for a long separate post.

There are really three powerful things to like about C#. First, the language itself. While by far I can't claim to have figured everything out, for the most part I can't argue with it. Lots of powerful concepts and a general feeling of consistency (as opposed, for example, to Perl's kitchen sink collection of stuff).

Second, there is the .NET class libraries. There is an awful lot there to cover many things you'd want to do, and again there is a reasonably strong sense of consistent design. Here I might find more to quibble over, but it generally hangs together.

Third, there is Visual Studio, a very slick integrated development environment (IDE). The help facility is very powerful for exploring the language, the error messages are generally good, and the ability to browse data in a running program is superb. Furthermore, you can perform a remarkable degree of editing on a running program -- there are many things not allowed, but a lot of runtime errors can simply be edited away and the program continued from where the exception occurred.

However, there is one key drawback to C# from a bioinformatics standpoint: you are not going with the crowd. There appear to have been at least two efforts to create C# bioinformatics libraries for C#, and both appear to have been stillborn. If you Google for "C# bioinformatics" or
.NET bioinformatics" you find stuff, but more idle talk than solid work. And I think there is an obvious reason for that.

All three of the legs are controlled, or at least perceived to be controlled, by the Emperor Gates. If you do click around some of the google links it's not hard to find disdainful comments about the perceived Microsoftity or Windowsosity of C#/.NET. There is an effort called MONO to port the whole slew over to UNIX boxes, but it's not clear this is perceived as more than a fig leaf. The name certainly isn't going to win friends among undergraduates -- "Have you gotten MONO yet?".

On the other hand, there is definitely corporate interest. Microsoft has been making increasing noises about bioinformatics, though perhaps focused further downstream than where I usually work. Spotfire, which is really useful for data exploration, I've heard provides a .NET API. Certainly during my interviews last year I saw C# books or heard mention of it at many of the companies.

So, it's a locally packed but globabaly lonely world to be a C# bioinformaticist. Luckily, it wasn't hard to build the critical tools I needed -- but I needed only a modest subset of what BioPerl, BioPython or BioJava would provide. However, there are some interesting ways to leverage those tool sets -- though that will have to be another subject for another time

Yet Another Far Out Sequencing Idea?

GenomeWeb carries the news that another little-known company, this time English, has thrown its hat into the Archon X-Prize ring.

Base4 Innovation has a website, but it's pretty sparse on details. A lot of cool buzzwords -- nanotechnology, single-photon imaging, direct readout of DNA, but not much more to go on. $500/genome in hours is the target throughput (no mention of error bars on those estimates!)

One of the interesting things to observe as the genome sequencing field heats up is how many non-traditional entrants are being attracted. When the genome sequencing X-Prize was first announced, one of my immediate ponderings was to what degree the entrants would simply be the familiar names in genome sequencing, and which would be out of left field. If I had to place wagers, I would put the outsiders as longshots -- but that's very different than writing them off.

The first X-Prize was personally very exciting, as it would appear to offer a route to realization of a permanent dream -- and I don't have $20M lying around for a trip to the ISS (sizable donations towards that goal, however, will not be refused!). For less than the price of a decent house in the Boston area one will soon be able to get a short trip to sub-orbital space (isn't that what home equity lines were invented for?).

The original X-Prize, though, had a very straightforward goal -- two flights to a certain altitude in a certain timeframe with requirements as to how much of the vehicle was reused (okay, perhaps not so simple to state). The genome sequencing prize has what are really much more (IMHO) comparatively ambitious goals which are harder to define -- after all, the space prize went for replicating a 40-year old feat with private money, whereas the genome sequencing prize will demand going far ahead of current capabilities in the areas of cost and speed.

The space X-prize was won hands-down by one competitor, with nobody else anywhere close. Well, one competitor claimed to the last minute they were close, but it started to smell suspiciously like a publicity stunt for their main sponsor, an utterly shameless internet venture (in applied probability) which also paid streakers to run through the Torino Olympic ceremonies. Will the genome sequencing race also have a runaway entrant, or will it be a photo finish. Stay tuned.

Tuesday, October 09, 2007

This Old Genome

I recently stumbled across a paper proposing a set of mammalian genomes for sequencing to further aging research. A free version of the proposal can be found via this site. I had previously posted some ponderings about what the most interesting unsequenced genomes are, and this would be one focused take on that question.

Despite the fact that it is clearly a process I will be familiar with, I'll confess a lot of ignorance about aging. The paper lays out a good rationale for the mammals it chooses (though with a mammalian focus, misses the opportunity to sequence the tortoise genome!).

This paper is also worth noting as something we will not see many more of. Not because there aren't plenty of interesting genomes to sequence, but because it won't be worth writing a paper about your plans to do so. Once genome sequencing becomes very cheap, a proposal to sequence a mammalian genome will become just a paragraph in a grant proposal at most, or more likely something mentioned only after the fact in an annual grant report. Certainly in the world of small genomes, such as bacteria, the trouble will be getting the samples to sequence, not the cost of sequencing.

On the other hand, even with really cheap genome sequencing, it will be a long time before all species are done -- even if some scientist has an inordinate fondness for beetle genomes!

Tuesday, October 02, 2007

Interference Inteference

A recent publication in Nucleic Acids Research (a very fine journal which is now all open-access) highlights an underappreciated (IMHO) aspect of RNA interference, or RNAi, studies of gene function, and may also be relevant to the therapeutic application of RNAi.

RNAi is another one of those amazing bits of biology (with restriction enzymes another obvious example) which seem too good to be true: short, computational bits of nucleotide sequence can specifically knock down the expression of targeted genes. Much of the work in the field has been on attempting to identify and control so-called off-target effects, as the specificity is not perfect. In the worst case, all of your novel hits may turn out to simply be off-target effects back to a not-at-all novel gene for the function of interest.

One strategy widely employed to reduce off-target effects is to use pools of siRNAs, with the general thought that if the on-target effects are additive and each siRNA has its own idiosyncratic list of off-targets, then the off-target effects will be diluted but the on-target ones amplified. There is more than just hope to support this, but a possible problem emerges: can the individual siRNAs interfere with each other. In particular, could one bad siRNA in the pool clobber the effects of the others, as siRNA design isn't quite perfect.

One way such an effect could be realized is if all the siRNAs are competing for a limited resource. siRNA does not work by magic, but rather by utilizing built-in cellular machinery. If the excess capacity of that machinery, above the load already placed by normal cellular processes, is soaked up by the applied siRNAs, then interference between siRNAs could result.

One key result in the new paper is that the levels of RISC, the key RNAi-executing complex, vary across cell lines. Biology tends to be a synonym with variability, but this isn't always accounted for in experimental designs. This may translate into experiments behaving very differently by cell line, and given the somewhat shadowy understanding of cell lines, this is not great news.

The paper goes on to identify Ago2 as the key protein whose levels affect siRNA competition. By tinkering with Ago2 expression, either up or down, the interference effects can also be modulated.

As the authors summarize, this all stresses the need for being cautious in designing & interpreting RNAi experiments and in extrapolating results in one cell line to others. At my previous posting I looked at a lot of RNAi papers, and as in the days of microarrays there was a worrisome low degree of overlap in the hits between ostensibly equivalent screens. In one case, two papers claiming to use the same cell line came up with incompatible phenotypes for one particular gene knockdown. Measuring Ago2 levels is a control which should be strongly considered for these experiments, and results from pooled siRNA experiments without deconvolution into individual siRNAs aren't to be trusted (I'm not sure I've seen such published, but I'm sure people are tempted). RNAi is a powerful means to functional analysis & potentially a useful therapeutic modality, but it's not quite as clean & simple as one might dream of.

Monday, October 01, 2007

Spot's Ridges (& Ridge's Spots?)

Miss Amanda is quite excited about two new papers on the Nature Genetics preprint site, though as we don't have a subscription we're stuck reading just the abstracts and the supplementary material. The papers use genetic mapping for fine-scale mapping of the variations responsible for two visible phenotypes: the distinctive back ridge in ridgeback dogs and a coat spotting phenotype found in many breeds.

A particularly striking claim in the one abstract is that this mapping could be accomplished with approximately 20 individuals. This is quite a small number, and would suggest that many mendelian traits in dogs will be rapidly mapped given the modest (by genomics standards) cost of doing an experiment (arrays are already below the $1K/sample mark) I've promised the little miss we can go halfsies on any papers on floppy ears, curled tails or flat faces.

The ridgeback variant is also interesting because it is a copy number variation, a very hot class of genetic variations lately. The duplicated region contains three FGF family members, growth factors known to play roles in development. Of further interest is that the polymorphism also tracks with a nasal abnormality also seen in these dogs. Many pure breeds suffer from distinct maladies which are often direct results of the physical shape of the canine. For example, short snouts raise the risk of eye injury, which is a trauma M.A. suffered soon after arriving at our abode. However, in this case it would appear that the phenotypes have an underlying biological explanation that is not simply that the shape but a common developmental trigger.

This is the time of year for agricultural fairs & I was recently (as usual, biogeek that I am) strolling through one marveling at the range of breeds of various animals. Chickens are perhaps the showiest at these affairs, but there are also lots of varieties of goats, sheep, cows, horses, ducks, rabbits, cavies, etc. Most of these species have draft genomes in one form or another, and with the cost of sequencing sliding down surely all will have one before long. Sequencing a sample of individuals will enable mapping assays to be developed, which is becoming routine. Before long, many of those phenotypic variants, both showy and practical, will be mapped and identified. Other species with many identified breeds, such as cats, goldfish or Darwin's pigeons, will become straightforward to analyze as well.

Dogs do offer the most spectacular gains. This is not just pure boosterism, but just a reflection that dogs seem to have been selected by humans for such a wide variety of traits: shape, color and particularly behavior. I love cats too, but there just aren't any herding breeds!

Dog genetics is also an early example of direct-to-consumer genetic scanning -- one can check up on the breed heritage of a dog. There is a dog up the street which was marketed as a purebred Shih Tzu, but the face is radically different from my companion's. Nothing wrong with that, and it was probably just a bit of confusion at the breeder, though Amanda thinks it is more of an example a Svejk-style skulduggery (I should never have read that stuff to her!).