Sunday, May 31, 2009

Teasing small insertion/deletion events from next-gen data

My interest in next-generation sequencing is well on the way from shifting from hobby to work-central, which is exciting. So I'm now really paying attention to the literature on the subject.

One of the interesting uses for next-generation sequencing is identifying insertion or deletion alleles (indels) in genomes, particularly the human genome. Of course, the best way to do this is to do a lot of sequencing, compare the sequence reads against a reference genome, and identify specific insertions or deletions in the reads. However, this is generally going to require a full genome run & a certain amount of luck, especially in a diploid organism as you might not sample both alleles enough to see a heterozygous indel. A cancer genome might be even worse: these often have many more than two copies of the DNA at a given position and potentially there could be more than two different versions. In any case, full genome runs are in the ballpark of $50K, so if you really want to look at a lot of genomes a more efficient strategy is needed.

The most common approach is to sequence both ends of a DNA molecule and then compare the predicted distance between those ends with the distance on the reference genome. If you know the distribution of lengths that the sequence library has, then you can spot cases where the length on the reference is very different. In effect, you've lengthened (but made less precise) your ruler for measuring indels, and so you need many fewer measurements to find them.

One aside: in a recent Cancer Genomics webinar I watched a distinction was made between "mate pairs" and "paired ends" -- except now I forget which they assigned to which label (and am too lazy/time strapped to watch the webinar right now). In short, one is the case of sequencing both ends of a standardly prepared next-generation library, and the other involves snipping the middle out of a very large fragment to create the next-gen sequencing target. Here I was prepared to go pedantic and I'm caught napping!

Of course, that is if you know the distribution of DNA insert sizes. While you might have an estimate from the way the library is prepared, an obvious extension would be to infer the library's distribution from the actual data. An even more clever approach would be to use this distribution to pick out candidates in which the paired end sequences lie well within the distribution, but are consistently shifted relative to that distribution.

A paper fresh out of Nature Methods (subscription required & no abstract) incorporates precisely these ideas into a program called MoDIL. The program also explicitly models heterozygosity, allowing it to find heterozygous indels.

In performance analysis on actual human shotgun sequence, the MoDIL paper claims 95+% sensitivity for detecting indels of >=20bp. I tfor library used, this is detecting 10% length difference (insert size mean: 208; stdev: 13). The supplementary materials also look at the ability to detect heterozygous deletions of various sizes as a function of genome coverage (the actual sequencing data used had 120X clone coverage, meaning the average nucleotide in the genome would be found in 120 DNA fragments in the sequencing run). Dropping the coverage by a factor of 3 would be expect to still pick up most indels of >=40.

Lee, S., Hormozdiari, F., Alkan, C., & Brudno, M. (2009). MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions Nature Methods DOI: 10.1038/nmeth.f.256

Monday, May 25, 2009

Pondering tumor suppressors

Now that I'm back in the cancer field full-time, I spend a lot of that time pondering the mysteries of the disease. Despite an explosion of knowledge about the disease during my lifetime, we truly don't understand how it works. In many ways we're still at the stage of the old story of seven blind men, not having figured out the elephant in front of us.

Sometimes when genes acquire mutations this moves a cell on the road to cancer. Such genes fall into two general categories. Oncogenes acquire activating mutations or are amplified and then play an active role in cancer. Tumor suppressors lead to disease when they are inactivated by mutations. A handful of genes have a very murky status, seemingly able to play both roles.

Many tumor suppressors were discovered through rare hereditary syndromes characterized by tumors. For example, RB1 is the retinoblastoma gene; inactivation of this gene in the retina leads to horrific tumors of the eye. NF1 is the neurofibramatosis gene; inactivation leads to benign tumors from nerves. Perhaps the best known in the popular space are BRCA1 and BRCA2, which greatly raise the risk of breast and ovarian cancer.

A great mystery for many such genes is why the tissue specificity of the tumor syndrome? In each of the genes mentioned above, the tumor syndrome appears to be very specific to a tissue type, yet in each of these cases the genes involved have been shown to be parts of cellular machinery used by every cell. Why does a failure of a general part manifest itself so specifically?

As we dig deeper into the genes and cancer, some of these distinctions do start smudging. BRCA1 mutations, for example, do also raise the risk of pancreatic cancer -- but not nearly to the extent as for breast cancer. If we look not at known hereditary links to cancer but the genes mutated in any cancer, we see these same players showing up. For example, RB1 is frequently mutated in a variety of cancers, including lung cancers.

Here's an interesting further bit to ponder. BRCA1 and BRCA2 are in a pathway together, so it is not surprising that mutating either one would have a similar effect. But again, mutations in other members of the pathway lead to other genetic disorders with different spectra of cancers.

Now a new bit of the puzzle that continues the puzzling. One of the physical partners of BRCA1 is BARD1. A lot of effort has gone into finding variants in BARD1 and attempting to demonstrate their relevance to breast cancer risk. While many variants have been found in BARD1, the linkage to breast cancer is weak if it exists at all. But a new paper now links germline variation in BARD1 to the risk of aggressive neuroblastomas.

The one clear thread in this is that continuing to cross-reference these known tumor suppressors and their partners (such as this recent report on PALB2, a physical partner of BRCA2 with links now to breast and pancreatic cancer) with emerging genetic information will yield fruit. There are probably many more such associations to be found and perhaps additional proteins in these pathways to be uncovered. But when will we finally conceptualize the elephant? That remains to be seen

Tuesday, May 19, 2009

is Wolfram Alpha good for anything???

The much heralded web tool Wolfram Alpha debuted yesterday -- and I completely forgot about it. But today a coworker asked me about it & I kicked into full-blown test mode. Count me as underwhelmed.

Now, one of things which it is supposed to excel at is collecting information or doing calculations. To be glib: it's not a search tool, but a find tool. I've thrown a bunch of queries at it, and have yet to find something really cool.

My first queries were complete duds. Asking for the fastest train time between New York and Chicago yielded a flight time from New York to Chicago usually elicits the "I don't understand you" message, though some wording I've lost gave me a time to a town in Europe called Train.

If you plug in a human gene name, the result is a sort of simplified Entrez gene name query. In some ways it is nice, but in others I found it less than fulfilling. Plug in KRAS and you get an overview of KRAS's genetic structure, but nothing about the fact that certain mutations in this gene are oncogenic. Don't put "gene" in the query and it guesses you mean some airport, though it does suggest the gene as an alternate option. Similarly, if you plug in EGFR, it's disappointing that it doesn't mention any of the important chemotherapeutics which target this.

Calculating things is supposed to be its forte, so I tried a bunch. The first few didn't work well (e.g. how many carbon atoms in human chromosome X), but I do now know where I can convert from millimeters to furlongs. So useful! Or even better, convert 60mph to angstroms per nanosecond -- how did I ever live without this?

One side complaint: Wolfram Alpha seems to be a nearly closed universe. Occasionally it will link out to Wikipedia on the side, but most of the facts it presents are dead ends. So if you think it's wrong, such as below, there's no obvious way to figure out how it figured out what it told you.

Similarly, it could use to explain itself a bit more. I asked it to opine on the most important classification question in the world, and after several attempts "taxonomy of panda" (won't work with "pandas") I get the message "Assuming Ailuropoda melanoleuca | Use Ailurus fulgens instead" -- but nowhere does it give a common name or picture for either of these critters. Curiously, Wolfram Alpha puts "Ailurus fulgens" (the red panda) in with bears, where it definitely doesn't belong. I hadn't kept up with their taxonomy; according to both NCBI & Wikipedia they're now their own branch of carnivores and not in the Raccoon family.

The front page suggests typing in dates. Just putting in a day and month with no year was particularly useless, but other things I put in had curious results. September 11th, 2001 notes that the World Trade Center was destroyed, along with the death of one of the terrorists. December 7th, 1941 yields the attack on Pearl Harbor.

But can you believe that the only significant event it can remember for July 20th, 1969 is the birth of a minor TV actor Josh Holloway? That most glorious day in human technological achievement and it can only find some face-of-the-moment? AIIGGGHH!!!!!!!!!!!!!

Monday, May 11, 2009

Gene Tests Don't Blow Up!

Today's Globe has a profile of the do-it-yourself genetic testing experiment that my former colleague Kay Aull is performing. Among the people quoted is yours truly.

Okay, it's really cool. I did once get a mention with several sentences in Newsweek (with a very distressed Mickey Mouse on the cover) but this time I got several column inches. However, after I gave the phone interview I came down with a small case of the worries. What if I was misquoted? Worse, what if I was correctly quoted but pulled a Watson? Luckily, what made it in fails to induce embarrassment, though there are bits which I wish hadn't been left out.

The article is well worth reading (though it may become a pay article overnight; I forget the current policy). With luck the wire services & aggregators will pick up on it.

I think anyone interested in genetic testing, DIY-bio, or just science in general should skim the comments thread. There's a lot there to be worried about.

First, a running theme is a worry that Kay will blow up her block or such. Multiple posters, many claiming to work in labs. Now, as Kay's comment (which is nice and level-headed, as I would have expected) points out, she's not using anything liable to do anything like that. For the level of ethanol precipitation she's doing, a fifth of vodka would last quite a long time (an interesting experiment; I remember the Russians are said to have built lasers with the stuff).

A second class of fear is other sorts of toxins, primarily the spectre of ethidium bromide (a known carcinogen) as a DNA stain. There are other, much safer stains, and it turns out that's what's Kay is using.

Another general negative sentiment is that perhaps the city or her landlord should be (or might) shut this down. I'm no lawyer, but this certainly wasn't obviously prohibited by any of my lease agreements. Putting household cleaners in the public's hands (or solvents in the form of nail polish or paint removers) scares me far more than a little PCR.

One more sentiment worth noting: that this sort of thing should be done only in an official laboratory and that Kay shouldn't do this without getting a masters or Ph.D. first. I suspect that these posters aren't aware that many of the same techniques are available in the toy section of any Target or Wal-Mart. True, none of those offer PCR -- but they easily could. PCR can be run without any special gear, though it would be awfully tedious. They are probably also unaware of modern scientists who worked without Ph.D.s (e.g. Nobelist Gertrude Elion) or in home labs (e.g. Nobelist Rita Levi-Montalcini)

On the other end of things, some of the positive posters are a bit worrisome. One makes the quite apropos comparison of this to having a home darkroom, but gets their chemicals confused -- while the stop solution is indeed just acetic acid, the fixer is not "drinkable but dull" but rather cyanide-based (cyanide is a great remover of silver, which is the job of the fixer).

There are also a number of posters who suggest that this information might be used against her by an insurance company or that it would be illegal to withhold it from same. Whether this would be prohibited by GINA isn't considered; I'm guessing the poster's aren't familiar with it. Another poster relishes the idea that
Perhaps she objects to the greed of her peers at Harvard who are charging people for the opportunity to get similar bio data - See
-- which is bizarre, given that the very GenomeWeb article mentions that these tests are free to participants!

Regardless of how poorly informed or quick to leap to conclusions some of these folks are, this is indeed the landscape of public opinion, at least as plumbed by response to this article. It would suggest that there is a lot of educating to do & that it will be an uphill battle. To a lot of people, science means formal labs and formal training and labs mean dangerous chemicals that might explode.

Sunday, May 03, 2009

The New Gig

I've always been a fan of the space program and I like movies, so when a movie astronaut speaks I listen. Since Beyond Genomics changed it's name to BG Medicine, I can only interpret the advice as directing me to Infinity Pharmaceuticals.

Seriously, tomorrow I start at Infinity. Infinity has a number of anti-cancer programs which it is exciting to be joining. Of course, having drugs in the clinic can be a rocky ride; the day I agreed to go was the day a clinical trial was halted, and Infinity's stock fell 30% (or does somebody on Wall Street just not like me?)

Strange but true story: The day of my interview, a new Netflix disc was scheduled to arrive. The title: Infinity. Spooky!

As far as this space, there will probably be some subtle shifts. I'm probably a little too careful about not posting directly around where I'm working, but that is my habit and so areas such as cancer genomics may see less action. Infinity, as mentioned above, is public & so one must follow certain rules.

On the other hand, that still leaves a lot of biology to comment on. I probably will mine more of synthetic biology, a lot of genomics/proteomics/younameitomics and evolution. Computational stuff I'm working on -- plus some old interests that were lit anew during my time out. Plus some of my learnings from that time, where I set up and then dismantled a trans-Pacific consulting empire (yep! often had to cross Pacific Street to go from one client to another).