Wednesday, July 01, 2009

Gene Expression from A-Z

I was playing with the data from an early RNA-Seq paper just to have a general idea of what such data looks like and to check out some favorite genes. It was also an exercise in learning the latest Spotfire -- I had Spotfire back at MLNM but it's been over 2 years and a completely new interface was rolled out.

An easy way to find favorite genes was and compare across the three tissues (brain, liver, muscle) is to set up a trellis plot with expression as the y-axis and the gene name as the x-axis, and then use the filtering tools to find my genes. Of course, it's hard to avoid looking at the overall plot -- and picking out some fortuitous patterns.

What immediately jumps out are the three semi-blank vertical zones (on the original you can spot a fourth very thin one convincingly in the original; it's vaguely there in the PNG shown here). What are these? Take a guess before reading below.


The big one are all genes starting with "Olf" -- the olfactory receptors. This is a large subfamily of type I G-protein coupled receptors (GPCRs) whose discovery netted a Nobel Prize. In general, these are expressed solely in the olfactory epithelium, but a little more on that later.

The thin line to the left of it has genes starting with Mirn -- micrornas, which this particularly sequencing effort wasn't very tuned for. The next one to the left has genes starting with Ig -- immunoglobulin genes. Since B-cells are not one of the samples, low expression there is no shocker. The very thin line to the right of the Olf cluster which you might not see all start with Vr1 -- the vomeronasal receptors, another bit of specialized GPCRs involved in pheromone recognition.

Of course, especially having an interactive display, you can find other patterns. A block of genes starting with Mrp have very similar, high expressions in all three tissues -- the mitochondrial ribosomal proteins. A clump enriched for names starting with Psm shows a similar pattern -- the proteasome subunits.

I don't recommend spending a lot of time doing this analysis -- the visual cortex is too good at picking up patterns & clearly gene names were not picked to make this a great way to find biology. But it is mildly fascinating.

One further note. While the Olf cluster has a lot of low expression, it isn't devoid of expression (below; ignore the sides as I'm still learning how to quite get the boundaries set precisely in SF). Furthermore, some of the same genes are seen in all three samples. Now, this could be erroneous due to improper fragment mapping or some other transcriptionally active gene that overlaps these, but I think we should also be open to the idea that some of the olfactory receptors may have been co-opted for other purposes. After all, if there is a battery of diverse proteins with a spectacular range and sensitivity for different compounds, why wouldn't some be used for something other than exploring the environment?

Monday, June 29, 2009

Lox: The last genome for electrophoretic Sanger?

Amongst the news last week is a bit of a surprise: the salmon genome project is choosing Sanger sequencing for the first phase of the project. Alas, one needs a premium subscription to In Sequence, which I lack, so I can't read the full article. But, the group has published (open access) a pilot study on some BACs, which concluded that 454 sequencing couldn't resolve a bunch of the sequence, and so shorter read technologies are presumably ruled out as well. A goal of the project is a high quality reference sequence to serve as a benchmark for related fish, demanding very high quality.

This announcement is a jolt for anyone who has concluded that Sanger has been largely put to pasture, confined to niches such as verifying clones and low-throughput projects. Despite the gaudy throughput of the next-gen sequencers, read length remains a problem. However, that hasn't stopped de novo assembly projects such as panda from apparently proceeding forward. Apparently salmon is even nastier when it comes to repeats.

Still playing the armchair next-gen sequencer (for the moment!), it is an interesting gedanken experiment. Suppose you had a rough genome you really, really wanted to sequence and get a high-quality reference sequence. On the one hand, Sanger sequencing is very well proven. However, it is also more expensive per base than the newer technologies. Furthermore, Sanger is pretty much a mature technology, with little investment in further improvement. This is in contrast to next gen platforms, which are being pushed harder and harder both by the manufacturers as well as the more adventurous users. This includes novel sequencing protocols to address difficult DNA, such as the recently published Long March technique (which I'm still fully wrapping my head around) that generates nested libraries for next-gen sequencing using a serial Type IIS digestion scheme. Complete Genomics has some trick for inserting multiple priming sites per circular DNA template. Plus, Pacific Biosciences has demonstrated really long reads in a next gen platform -- but demonstrating is different than having it in production.

So it boils down to the key question: do you spend your resources on the tried-and-true, but potentially pricey approach or try to bet that emerging techniques and technologies can deliver the goods soon enough. Put another way, how critical is a high quality reference sequence? Perhaps it would be better to generate very piecemeal drafts of multiple species now and then go for finishing the genomes when the new technologies come on line. But what experiments dependent on that high quality reference would be put off a few years? And what if the new technologies don't deliver, in which case you must fall back on Sanger and be quite a bit behind schedule.

It's not an easy call. Will salmon be the last Sanger genome? It all depends on whether the new approaches and platforms can really deliver -- and someone is daring enough to try them on a really challenging genome.

Sunday, June 21, 2009

Cancer Genome Sequencing--A (Pessimistic) Interim Analysis

The current issue of Cancer Research carries a very brief (3 pages, with one page mostly tables & figures) review of the first pulse of cancer genome sequencing papers (sub required to read article). While sub-titled 'An Interim Analysis', perhaps a better subtitle would be 'A Uniformly Negative Analysis'.

A full-press cancer genomics project has been a controversial drive, with many bemoaning the huge amount of resources devoted it and believing other avenues would be better suited for enhancing our ability to help cancer patients. But it has gone forward, and a spate of papers over the last year have reported the early results.

The initial papers have covered 4 of the big cancers in terms of incidence and mortality (lung, breast, colorectal and pancreatic) as well as glioblastoma and leukemia. Different studies have taken different tacks. In leukemia, we have the first parallel complete sequencing of a patient and their tumor. Papers in breast, colorectal (together covered in two papers here and here), pancreatic and glioblastoma looked at huge numbers of coding exons in small numbers of patients (11 patients x 18.2Kgenes for breast and colorectal; 21 patients x 20.6Kgenes for glioblastoma; 24 patients x 20.6Kgenes for pancreatic). A lung paper and the other glioblastoma paper looked at ~600 genes, but in larger numbers of patients (188 in lung and 91 in glioblastoma).

Personally, I would take a more nuanced view of the results. I think it is hard to argue that these papers have had a shortage of fireworks there have been some important observations made, which curiously the Cancer Research review ignore completely. In the lung study (which I have studied the closest) these include important exclusion and cooperativity relationships between mutations and a number of novel, druggable candidate driver genes (protein kinases) not previously suspected in lung cancer. In the many genes few patients glioblastoma study, it was the identificaiton of a mutational hotspot in isocitrate dehydrogenase 1 (later found to be present, though less frequently mutated, in isocitrate dehydrogenase 2).

Of course, one thing which is changing rapidly is the cost of doing these studies. Most of these papers used conventional PCR amplification and Sanger sequencing, which I would lowball estimate at $1/well (very lowball, but Sandra Porter caught some serious flak suggesting [as I have] a number much higher than this for the sequencing part, and I don't have the accounting experience to argue -- but I do know people who calculated it at Codon and this would be a very low estimate) -- so those studies looking at nearly every coding exon were at least a quarter million per patient (those 20+K genes explode out to about a quarter million exons). Clearly this isn't how things will tend to be done going forward; Illumina will now blow away genomes for $48K each and other companies are now quoting even lower. This is still well in excess of the per patient estimate for the very focused studies, and I believe these (particularly the lung study) demonstrate the value of lots of patients, since this started to give the numbers required to look at interactions between mutations.

One of the reasons the Cancer Research authors aren't terribly pleased with the progress is clear: they feel the experiments aren't the correct ones. But whereas some of the flak I had seen directed at the cancer genome sequence concept was instead promoting more functional approaches (such as RNAi library screening), what these authors want (or at least set as the minimum bar of for interesting) is cancer genome screening on an almost monomaniacal scale: thousands if not millions of individual cells from the same tumor! Clearly this would be fascinating, as there is plenty of evidence that tumors are a motley collection of genetically variant cells (but clonal -- all the tumor cells have the same ancestor, but they also are all sloppy DNA copyists). And, as they note, no DNA sequencing technology here now or on the immediate horizon has any shot at a project of this scale.

While I do believe this would be interesting, I'm not as certain it would be informative for patient care. Since many of these mutations are under very little selection, the spectrum of observed mutations is likely to be enormous. Given that there is already a horrendous backlog of characterizing mutations seen in the studies to date (though there has been a paper already functionally characterizing the isocitrate dehydrogenase mutations)

What is particularly strange about this view is that a more reasonable intermediate step would be to look at those cells that do escape the primary tumor (most of the cancer genome papers so far have focused on primary tumors, though the IDH mutations are primarily found in secondary glioblastomas) -- sequence the metastases. Ideally, this would mean finding multiple patients willing to consent to their genome, their primary's genome, and multiple metastases' genomes being sequenced -- the latter quite likely coming from autopsies (otherwise it is a lot of painful biopsying without much hope of helping the patient, an ethically questionable activity). Or, in leukemias one could more easily resequence after each relapse. Such studies would be doable technically and not cost ridiculous (though clearly not chump change either).

There's also the open question as to whether the real fireworks will come from sequencing less studied cancers, such as the recent success in using transcriptome sequencing to identify the probable causative mutation in a rare type of ovarian cancer (see also the News and Views piece). Perhaps we've mined the rich ore out of some of these veins, and it is the less worked seams which will yield fine genomic insights. Time will tell.

Sunday, May 31, 2009

Teasing small insertion/deletion events from next-gen data

My interest in next-generation sequencing is well on the way from shifting from hobby to work-central, which is exciting. So I'm now really paying attention to the literature on the subject.

One of the interesting uses for next-generation sequencing is identifying insertion or deletion alleles (indels) in genomes, particularly the human genome. Of course, the best way to do this is to do a lot of sequencing, compare the sequence reads against a reference genome, and identify specific insertions or deletions in the reads. However, this is generally going to require a full genome run & a certain amount of luck, especially in a diploid organism as you might not sample both alleles enough to see a heterozygous indel. A cancer genome might be even worse: these often have many more than two copies of the DNA at a given position and potentially there could be more than two different versions. In any case, full genome runs are in the ballpark of $50K, so if you really want to look at a lot of genomes a more efficient strategy is needed.

The most common approach is to sequence both ends of a DNA molecule and then compare the predicted distance between those ends with the distance on the reference genome. If you know the distribution of lengths that the sequence library has, then you can spot cases where the length on the reference is very different. In effect, you've lengthened (but made less precise) your ruler for measuring indels, and so you need many fewer measurements to find them.

One aside: in a recent Cancer Genomics webinar I watched a distinction was made between "mate pairs" and "paired ends" -- except now I forget which they assigned to which label (and am too lazy/time strapped to watch the webinar right now). In short, one is the case of sequencing both ends of a standardly prepared next-generation library, and the other involves snipping the middle out of a very large fragment to create the next-gen sequencing target. Here I was prepared to go pedantic and I'm caught napping!

Of course, that is if you know the distribution of DNA insert sizes. While you might have an estimate from the way the library is prepared, an obvious extension would be to infer the library's distribution from the actual data. An even more clever approach would be to use this distribution to pick out candidates in which the paired end sequences lie well within the distribution, but are consistently shifted relative to that distribution.

A paper fresh out of Nature Methods (subscription required & no abstract) incorporates precisely these ideas into a program called MoDIL. The program also explicitly models heterozygosity, allowing it to find heterozygous indels.

In performance analysis on actual human shotgun sequence, the MoDIL paper claims 95+% sensitivity for detecting indels of >=20bp. I tfor library used, this is detecting 10% length difference (insert size mean: 208; stdev: 13). The supplementary materials also look at the ability to detect heterozygous deletions of various sizes as a function of genome coverage (the actual sequencing data used had 120X clone coverage, meaning the average nucleotide in the genome would be found in 120 DNA fragments in the sequencing run). Dropping the coverage by a factor of 3 would be expect to still pick up most indels of >=40.

Lee, S., Hormozdiari, F., Alkan, C., & Brudno, M. (2009). MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions Nature Methods DOI: 10.1038/nmeth.f.256

ResearchBlogging.org

Monday, May 25, 2009

Pondering tumor suppressors

Now that I'm back in the cancer field full-time, I spend a lot of that time pondering the mysteries of the disease. Despite an explosion of knowledge about the disease during my lifetime, we truly don't understand how it works. In many ways we're still at the stage of the old story of seven blind men, not having figured out the elephant in front of us.

Sometimes when genes acquire mutations this moves a cell on the road to cancer. Such genes fall into two general categories. Oncogenes acquire activating mutations or are amplified and then play an active role in cancer. Tumor suppressors lead to disease when they are inactivated by mutations. A handful of genes have a very murky status, seemingly able to play both roles.

Many tumor suppressors were discovered through rare hereditary syndromes characterized by tumors. For example, RB1 is the retinoblastoma gene; inactivation of this gene in the retina leads to horrific tumors of the eye. NF1 is the neurofibramatosis gene; inactivation leads to benign tumors from nerves. Perhaps the best known in the popular space are BRCA1 and BRCA2, which greatly raise the risk of breast and ovarian cancer.

A great mystery for many such genes is why the tissue specificity of the tumor syndrome? In each of the genes mentioned above, the tumor syndrome appears to be very specific to a tissue type, yet in each of these cases the genes involved have been shown to be parts of cellular machinery used by every cell. Why does a failure of a general part manifest itself so specifically?

As we dig deeper into the genes and cancer, some of these distinctions do start smudging. BRCA1 mutations, for example, do also raise the risk of pancreatic cancer -- but not nearly to the extent as for breast cancer. If we look not at known hereditary links to cancer but the genes mutated in any cancer, we see these same players showing up. For example, RB1 is frequently mutated in a variety of cancers, including lung cancers.

Here's an interesting further bit to ponder. BRCA1 and BRCA2 are in a pathway together, so it is not surprising that mutating either one would have a similar effect. But again, mutations in other members of the pathway lead to other genetic disorders with different spectra of cancers.

Now a new bit of the puzzle that continues the puzzling. One of the physical partners of BRCA1 is BARD1. A lot of effort has gone into finding variants in BARD1 and attempting to demonstrate their relevance to breast cancer risk. While many variants have been found in BARD1, the linkage to breast cancer is weak if it exists at all. But a new paper now links germline variation in BARD1 to the risk of aggressive neuroblastomas.

The one clear thread in this is that continuing to cross-reference these known tumor suppressors and their partners (such as this recent report on PALB2, a physical partner of BRCA2 with links now to breast and pancreatic cancer) with emerging genetic information will yield fruit. There are probably many more such associations to be found and perhaps additional proteins in these pathways to be uncovered. But when will we finally conceptualize the elephant? That remains to be seen

Tuesday, May 19, 2009

is Wolfram Alpha good for anything???

The much heralded web tool Wolfram Alpha debuted yesterday -- and I completely forgot about it. But today a coworker asked me about it & I kicked into full-blown test mode. Count me as underwhelmed.

Now, one of things which it is supposed to excel at is collecting information or doing calculations. To be glib: it's not a search tool, but a find tool. I've thrown a bunch of queries at it, and have yet to find something really cool.

My first queries were complete duds. Asking for the fastest train time between New York and Chicago yielded a flight time from New York to Chicago usually elicits the "I don't understand you" message, though some wording I've lost gave me a time to a town in Europe called Train.

If you plug in a human gene name, the result is a sort of simplified Entrez gene name query. In some ways it is nice, but in others I found it less than fulfilling. Plug in KRAS and you get an overview of KRAS's genetic structure, but nothing about the fact that certain mutations in this gene are oncogenic. Don't put "gene" in the query and it guesses you mean some airport, though it does suggest the gene as an alternate option. Similarly, if you plug in EGFR, it's disappointing that it doesn't mention any of the important chemotherapeutics which target this.

Calculating things is supposed to be its forte, so I tried a bunch. The first few didn't work well (e.g. how many carbon atoms in human chromosome X), but I do now know where I can convert from millimeters to furlongs. So useful! Or even better, convert 60mph to angstroms per nanosecond -- how did I ever live without this?

One side complaint: Wolfram Alpha seems to be a nearly closed universe. Occasionally it will link out to Wikipedia on the side, but most of the facts it presents are dead ends. So if you think it's wrong, such as below, there's no obvious way to figure out how it figured out what it told you.

Similarly, it could use to explain itself a bit more. I asked it to opine on the most important classification question in the world, and after several attempts "taxonomy of panda" (won't work with "pandas") I get the message "Assuming Ailuropoda melanoleuca | Use Ailurus fulgens instead" -- but nowhere does it give a common name or picture for either of these critters. Curiously, Wolfram Alpha puts "Ailurus fulgens" (the red panda) in with bears, where it definitely doesn't belong. I hadn't kept up with their taxonomy; according to both NCBI & Wikipedia they're now their own branch of carnivores and not in the Raccoon family.

The front page suggests typing in dates. Just putting in a day and month with no year was particularly useless, but other things I put in had curious results. September 11th, 2001 notes that the World Trade Center was destroyed, along with the death of one of the terrorists. December 7th, 1941 yields the attack on Pearl Harbor.

But can you believe that the only significant event it can remember for July 20th, 1969 is the birth of a minor TV actor Josh Holloway? That most glorious day in human technological achievement and it can only find some face-of-the-moment? AIIGGGHH!!!!!!!!!!!!!

Monday, May 11, 2009

Gene Tests Don't Blow Up!

Today's Globe has a profile of the do-it-yourself genetic testing experiment that my former colleague Kay Aull is performing. Among the people quoted is yours truly.

Okay, it's really cool. I did once get a mention with several sentences in Newsweek (with a very distressed Mickey Mouse on the cover) but this time I got several column inches. However, after I gave the phone interview I came down with a small case of the worries. What if I was misquoted? Worse, what if I was correctly quoted but pulled a Watson? Luckily, what made it in fails to induce embarrassment, though there are bits which I wish hadn't been left out.

The article is well worth reading (though it may become a pay article overnight; I forget the current policy). With luck the wire services & aggregators will pick up on it.

I think anyone interested in genetic testing, DIY-bio, or just science in general should skim the comments thread. There's a lot there to be worried about.

First, a running theme is a worry that Kay will blow up her block or such. Multiple posters, many claiming to work in labs. Now, as Kay's comment (which is nice and level-headed, as I would have expected) points out, she's not using anything liable to do anything like that. For the level of ethanol precipitation she's doing, a fifth of vodka would last quite a long time (an interesting experiment; I remember the Russians are said to have built lasers with the stuff).

A second class of fear is other sorts of toxins, primarily the spectre of ethidium bromide (a known carcinogen) as a DNA stain. There are other, much safer stains, and it turns out that's what's Kay is using.

Another general negative sentiment is that perhaps the city or her landlord should be (or might) shut this down. I'm no lawyer, but this certainly wasn't obviously prohibited by any of my lease agreements. Putting household cleaners in the public's hands (or solvents in the form of nail polish or paint removers) scares me far more than a little PCR.

One more sentiment worth noting: that this sort of thing should be done only in an official laboratory and that Kay shouldn't do this without getting a masters or Ph.D. first. I suspect that these posters aren't aware that many of the same techniques are available in the toy section of any Target or Wal-Mart. True, none of those offer PCR -- but they easily could. PCR can be run without any special gear, though it would be awfully tedious. They are probably also unaware of modern scientists who worked without Ph.D.s (e.g. Nobelist Gertrude Elion) or in home labs (e.g. Nobelist Rita Levi-Montalcini)

On the other end of things, some of the positive posters are a bit worrisome. One makes the quite apropos comparison of this to having a home darkroom, but gets their chemicals confused -- while the stop solution is indeed just acetic acid, the fixer is not "drinkable but dull" but rather cyanide-based (cyanide is a great remover of silver, which is the job of the fixer).

There are also a number of posters who suggest that this information might be used against her by an insurance company or that it would be illegal to withhold it from same. Whether this would be prohibited by GINA isn't considered; I'm guessing the poster's aren't familiar with it. Another poster relishes the idea that
Perhaps she objects to the greed of her peers at Harvard who are charging people for the opportunity to get similar bio data - See http://www.genomeweb.com/blog/round-100.
-- which is bizarre, given that the very GenomeWeb article mentions that these tests are free to participants!

Regardless of how poorly informed or quick to leap to conclusions some of these folks are, this is indeed the landscape of public opinion, at least as plumbed by response to this article. It would suggest that there is a lot of educating to do & that it will be an uphill battle. To a lot of people, science means formal labs and formal training and labs mean dangerous chemicals that might explode.