Monday, June 29, 2009

Lox: The last genome for electrophoretic Sanger?

Amongst the news last week is a bit of a surprise: the salmon genome project is choosing Sanger sequencing for the first phase of the project. Alas, one needs a premium subscription to In Sequence, which I lack, so I can't read the full article. But, the group has published (open access) a pilot study on some BACs, which concluded that 454 sequencing couldn't resolve a bunch of the sequence, and so shorter read technologies are presumably ruled out as well. A goal of the project is a high quality reference sequence to serve as a benchmark for related fish, demanding very high quality.

This announcement is a jolt for anyone who has concluded that Sanger has been largely put to pasture, confined to niches such as verifying clones and low-throughput projects. Despite the gaudy throughput of the next-gen sequencers, read length remains a problem. However, that hasn't stopped de novo assembly projects such as panda from apparently proceeding forward. Apparently salmon is even nastier when it comes to repeats.

Still playing the armchair next-gen sequencer (for the moment!), it is an interesting gedanken experiment. Suppose you had a rough genome you really, really wanted to sequence and get a high-quality reference sequence. On the one hand, Sanger sequencing is very well proven. However, it is also more expensive per base than the newer technologies. Furthermore, Sanger is pretty much a mature technology, with little investment in further improvement. This is in contrast to next gen platforms, which are being pushed harder and harder both by the manufacturers as well as the more adventurous users. This includes novel sequencing protocols to address difficult DNA, such as the recently published Long March technique (which I'm still fully wrapping my head around) that generates nested libraries for next-gen sequencing using a serial Type IIS digestion scheme. Complete Genomics has some trick for inserting multiple priming sites per circular DNA template. Plus, Pacific Biosciences has demonstrated really long reads in a next gen platform -- but demonstrating is different than having it in production.

So it boils down to the key question: do you spend your resources on the tried-and-true, but potentially pricey approach or try to bet that emerging techniques and technologies can deliver the goods soon enough. Put another way, how critical is a high quality reference sequence? Perhaps it would be better to generate very piecemeal drafts of multiple species now and then go for finishing the genomes when the new technologies come on line. But what experiments dependent on that high quality reference would be put off a few years? And what if the new technologies don't deliver, in which case you must fall back on Sanger and be quite a bit behind schedule.

It's not an easy call. Will salmon be the last Sanger genome? It all depends on whether the new approaches and platforms can really deliver -- and someone is daring enough to try them on a really challenging genome.

Sunday, June 21, 2009

Cancer Genome Sequencing--A (Pessimistic) Interim Analysis

The current issue of Cancer Research carries a very brief (3 pages, with one page mostly tables & figures) review of the first pulse of cancer genome sequencing papers (sub required to read article). While sub-titled 'An Interim Analysis', perhaps a better subtitle would be 'A Uniformly Negative Analysis'.

A full-press cancer genomics project has been a controversial drive, with many bemoaning the huge amount of resources devoted it and believing other avenues would be better suited for enhancing our ability to help cancer patients. But it has gone forward, and a spate of papers over the last year have reported the early results.

The initial papers have covered 4 of the big cancers in terms of incidence and mortality (lung, breast, colorectal and pancreatic) as well as glioblastoma and leukemia. Different studies have taken different tacks. In leukemia, we have the first parallel complete sequencing of a patient and their tumor. Papers in breast, colorectal (together covered in two papers here and here), pancreatic and glioblastoma looked at huge numbers of coding exons in small numbers of patients (11 patients x 18.2Kgenes for breast and colorectal; 21 patients x 20.6Kgenes for glioblastoma; 24 patients x 20.6Kgenes for pancreatic). A lung paper and the other glioblastoma paper looked at ~600 genes, but in larger numbers of patients (188 in lung and 91 in glioblastoma).

Personally, I would take a more nuanced view of the results. I think it is hard to argue that these papers have had a shortage of fireworks there have been some important observations made, which curiously the Cancer Research review ignore completely. In the lung study (which I have studied the closest) these include important exclusion and cooperativity relationships between mutations and a number of novel, druggable candidate driver genes (protein kinases) not previously suspected in lung cancer. In the many genes few patients glioblastoma study, it was the identificaiton of a mutational hotspot in isocitrate dehydrogenase 1 (later found to be present, though less frequently mutated, in isocitrate dehydrogenase 2).

Of course, one thing which is changing rapidly is the cost of doing these studies. Most of these papers used conventional PCR amplification and Sanger sequencing, which I would lowball estimate at $1/well (very lowball, but Sandra Porter caught some serious flak suggesting [as I have] a number much higher than this for the sequencing part, and I don't have the accounting experience to argue -- but I do know people who calculated it at Codon and this would be a very low estimate) -- so those studies looking at nearly every coding exon were at least a quarter million per patient (those 20+K genes explode out to about a quarter million exons). Clearly this isn't how things will tend to be done going forward; Illumina will now blow away genomes for $48K each and other companies are now quoting even lower. This is still well in excess of the per patient estimate for the very focused studies, and I believe these (particularly the lung study) demonstrate the value of lots of patients, since this started to give the numbers required to look at interactions between mutations.

One of the reasons the Cancer Research authors aren't terribly pleased with the progress is clear: they feel the experiments aren't the correct ones. But whereas some of the flak I had seen directed at the cancer genome sequence concept was instead promoting more functional approaches (such as RNAi library screening), what these authors want (or at least set as the minimum bar of for interesting) is cancer genome screening on an almost monomaniacal scale: thousands if not millions of individual cells from the same tumor! Clearly this would be fascinating, as there is plenty of evidence that tumors are a motley collection of genetically variant cells (but clonal -- all the tumor cells have the same ancestor, but they also are all sloppy DNA copyists). And, as they note, no DNA sequencing technology here now or on the immediate horizon has any shot at a project of this scale.

While I do believe this would be interesting, I'm not as certain it would be informative for patient care. Since many of these mutations are under very little selection, the spectrum of observed mutations is likely to be enormous. Given that there is already a horrendous backlog of characterizing mutations seen in the studies to date (though there has been a paper already functionally characterizing the isocitrate dehydrogenase mutations)

What is particularly strange about this view is that a more reasonable intermediate step would be to look at those cells that do escape the primary tumor (most of the cancer genome papers so far have focused on primary tumors, though the IDH mutations are primarily found in secondary glioblastomas) -- sequence the metastases. Ideally, this would mean finding multiple patients willing to consent to their genome, their primary's genome, and multiple metastases' genomes being sequenced -- the latter quite likely coming from autopsies (otherwise it is a lot of painful biopsying without much hope of helping the patient, an ethically questionable activity). Or, in leukemias one could more easily resequence after each relapse. Such studies would be doable technically and not cost ridiculous (though clearly not chump change either).

There's also the open question as to whether the real fireworks will come from sequencing less studied cancers, such as the recent success in using transcriptome sequencing to identify the probable causative mutation in a rare type of ovarian cancer (see also the News and Views piece). Perhaps we've mined the rich ore out of some of these veins, and it is the less worked seams which will yield fine genomic insights. Time will tell.

Sunday, May 31, 2009

Teasing small insertion/deletion events from next-gen data

My interest in next-generation sequencing is well on the way from shifting from hobby to work-central, which is exciting. So I'm now really paying attention to the literature on the subject.

One of the interesting uses for next-generation sequencing is identifying insertion or deletion alleles (indels) in genomes, particularly the human genome. Of course, the best way to do this is to do a lot of sequencing, compare the sequence reads against a reference genome, and identify specific insertions or deletions in the reads. However, this is generally going to require a full genome run & a certain amount of luck, especially in a diploid organism as you might not sample both alleles enough to see a heterozygous indel. A cancer genome might be even worse: these often have many more than two copies of the DNA at a given position and potentially there could be more than two different versions. In any case, full genome runs are in the ballpark of $50K, so if you really want to look at a lot of genomes a more efficient strategy is needed.

The most common approach is to sequence both ends of a DNA molecule and then compare the predicted distance between those ends with the distance on the reference genome. If you know the distribution of lengths that the sequence library has, then you can spot cases where the length on the reference is very different. In effect, you've lengthened (but made less precise) your ruler for measuring indels, and so you need many fewer measurements to find them.

One aside: in a recent Cancer Genomics webinar I watched a distinction was made between "mate pairs" and "paired ends" -- except now I forget which they assigned to which label (and am too lazy/time strapped to watch the webinar right now). In short, one is the case of sequencing both ends of a standardly prepared next-generation library, and the other involves snipping the middle out of a very large fragment to create the next-gen sequencing target. Here I was prepared to go pedantic and I'm caught napping!

Of course, that is if you know the distribution of DNA insert sizes. While you might have an estimate from the way the library is prepared, an obvious extension would be to infer the library's distribution from the actual data. An even more clever approach would be to use this distribution to pick out candidates in which the paired end sequences lie well within the distribution, but are consistently shifted relative to that distribution.

A paper fresh out of Nature Methods (subscription required & no abstract) incorporates precisely these ideas into a program called MoDIL. The program also explicitly models heterozygosity, allowing it to find heterozygous indels.

In performance analysis on actual human shotgun sequence, the MoDIL paper claims 95+% sensitivity for detecting indels of >=20bp. I tfor library used, this is detecting 10% length difference (insert size mean: 208; stdev: 13). The supplementary materials also look at the ability to detect heterozygous deletions of various sizes as a function of genome coverage (the actual sequencing data used had 120X clone coverage, meaning the average nucleotide in the genome would be found in 120 DNA fragments in the sequencing run). Dropping the coverage by a factor of 3 would be expect to still pick up most indels of >=40.

Lee, S., Hormozdiari, F., Alkan, C., & Brudno, M. (2009). MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions Nature Methods DOI: 10.1038/nmeth.f.256

ResearchBlogging.org

Monday, May 25, 2009

Pondering tumor suppressors

Now that I'm back in the cancer field full-time, I spend a lot of that time pondering the mysteries of the disease. Despite an explosion of knowledge about the disease during my lifetime, we truly don't understand how it works. In many ways we're still at the stage of the old story of seven blind men, not having figured out the elephant in front of us.

Sometimes when genes acquire mutations this moves a cell on the road to cancer. Such genes fall into two general categories. Oncogenes acquire activating mutations or are amplified and then play an active role in cancer. Tumor suppressors lead to disease when they are inactivated by mutations. A handful of genes have a very murky status, seemingly able to play both roles.

Many tumor suppressors were discovered through rare hereditary syndromes characterized by tumors. For example, RB1 is the retinoblastoma gene; inactivation of this gene in the retina leads to horrific tumors of the eye. NF1 is the neurofibramatosis gene; inactivation leads to benign tumors from nerves. Perhaps the best known in the popular space are BRCA1 and BRCA2, which greatly raise the risk of breast and ovarian cancer.

A great mystery for many such genes is why the tissue specificity of the tumor syndrome? In each of the genes mentioned above, the tumor syndrome appears to be very specific to a tissue type, yet in each of these cases the genes involved have been shown to be parts of cellular machinery used by every cell. Why does a failure of a general part manifest itself so specifically?

As we dig deeper into the genes and cancer, some of these distinctions do start smudging. BRCA1 mutations, for example, do also raise the risk of pancreatic cancer -- but not nearly to the extent as for breast cancer. If we look not at known hereditary links to cancer but the genes mutated in any cancer, we see these same players showing up. For example, RB1 is frequently mutated in a variety of cancers, including lung cancers.

Here's an interesting further bit to ponder. BRCA1 and BRCA2 are in a pathway together, so it is not surprising that mutating either one would have a similar effect. But again, mutations in other members of the pathway lead to other genetic disorders with different spectra of cancers.

Now a new bit of the puzzle that continues the puzzling. One of the physical partners of BRCA1 is BARD1. A lot of effort has gone into finding variants in BARD1 and attempting to demonstrate their relevance to breast cancer risk. While many variants have been found in BARD1, the linkage to breast cancer is weak if it exists at all. But a new paper now links germline variation in BARD1 to the risk of aggressive neuroblastomas.

The one clear thread in this is that continuing to cross-reference these known tumor suppressors and their partners (such as this recent report on PALB2, a physical partner of BRCA2 with links now to breast and pancreatic cancer) with emerging genetic information will yield fruit. There are probably many more such associations to be found and perhaps additional proteins in these pathways to be uncovered. But when will we finally conceptualize the elephant? That remains to be seen

Tuesday, May 19, 2009

is Wolfram Alpha good for anything???

The much heralded web tool Wolfram Alpha debuted yesterday -- and I completely forgot about it. But today a coworker asked me about it & I kicked into full-blown test mode. Count me as underwhelmed.

Now, one of things which it is supposed to excel at is collecting information or doing calculations. To be glib: it's not a search tool, but a find tool. I've thrown a bunch of queries at it, and have yet to find something really cool.

My first queries were complete duds. Asking for the fastest train time between New York and Chicago yielded a flight time from New York to Chicago usually elicits the "I don't understand you" message, though some wording I've lost gave me a time to a town in Europe called Train.

If you plug in a human gene name, the result is a sort of simplified Entrez gene name query. In some ways it is nice, but in others I found it less than fulfilling. Plug in KRAS and you get an overview of KRAS's genetic structure, but nothing about the fact that certain mutations in this gene are oncogenic. Don't put "gene" in the query and it guesses you mean some airport, though it does suggest the gene as an alternate option. Similarly, if you plug in EGFR, it's disappointing that it doesn't mention any of the important chemotherapeutics which target this.

Calculating things is supposed to be its forte, so I tried a bunch. The first few didn't work well (e.g. how many carbon atoms in human chromosome X), but I do now know where I can convert from millimeters to furlongs. So useful! Or even better, convert 60mph to angstroms per nanosecond -- how did I ever live without this?

One side complaint: Wolfram Alpha seems to be a nearly closed universe. Occasionally it will link out to Wikipedia on the side, but most of the facts it presents are dead ends. So if you think it's wrong, such as below, there's no obvious way to figure out how it figured out what it told you.

Similarly, it could use to explain itself a bit more. I asked it to opine on the most important classification question in the world, and after several attempts "taxonomy of panda" (won't work with "pandas") I get the message "Assuming Ailuropoda melanoleuca | Use Ailurus fulgens instead" -- but nowhere does it give a common name or picture for either of these critters. Curiously, Wolfram Alpha puts "Ailurus fulgens" (the red panda) in with bears, where it definitely doesn't belong. I hadn't kept up with their taxonomy; according to both NCBI & Wikipedia they're now their own branch of carnivores and not in the Raccoon family.

The front page suggests typing in dates. Just putting in a day and month with no year was particularly useless, but other things I put in had curious results. September 11th, 2001 notes that the World Trade Center was destroyed, along with the death of one of the terrorists. December 7th, 1941 yields the attack on Pearl Harbor.

But can you believe that the only significant event it can remember for July 20th, 1969 is the birth of a minor TV actor Josh Holloway? That most glorious day in human technological achievement and it can only find some face-of-the-moment? AIIGGGHH!!!!!!!!!!!!!

Monday, May 11, 2009

Gene Tests Don't Blow Up!

Today's Globe has a profile of the do-it-yourself genetic testing experiment that my former colleague Kay Aull is performing. Among the people quoted is yours truly.

Okay, it's really cool. I did once get a mention with several sentences in Newsweek (with a very distressed Mickey Mouse on the cover) but this time I got several column inches. However, after I gave the phone interview I came down with a small case of the worries. What if I was misquoted? Worse, what if I was correctly quoted but pulled a Watson? Luckily, what made it in fails to induce embarrassment, though there are bits which I wish hadn't been left out.

The article is well worth reading (though it may become a pay article overnight; I forget the current policy). With luck the wire services & aggregators will pick up on it.

I think anyone interested in genetic testing, DIY-bio, or just science in general should skim the comments thread. There's a lot there to be worried about.

First, a running theme is a worry that Kay will blow up her block or such. Multiple posters, many claiming to work in labs. Now, as Kay's comment (which is nice and level-headed, as I would have expected) points out, she's not using anything liable to do anything like that. For the level of ethanol precipitation she's doing, a fifth of vodka would last quite a long time (an interesting experiment; I remember the Russians are said to have built lasers with the stuff).

A second class of fear is other sorts of toxins, primarily the spectre of ethidium bromide (a known carcinogen) as a DNA stain. There are other, much safer stains, and it turns out that's what's Kay is using.

Another general negative sentiment is that perhaps the city or her landlord should be (or might) shut this down. I'm no lawyer, but this certainly wasn't obviously prohibited by any of my lease agreements. Putting household cleaners in the public's hands (or solvents in the form of nail polish or paint removers) scares me far more than a little PCR.

One more sentiment worth noting: that this sort of thing should be done only in an official laboratory and that Kay shouldn't do this without getting a masters or Ph.D. first. I suspect that these posters aren't aware that many of the same techniques are available in the toy section of any Target or Wal-Mart. True, none of those offer PCR -- but they easily could. PCR can be run without any special gear, though it would be awfully tedious. They are probably also unaware of modern scientists who worked without Ph.D.s (e.g. Nobelist Gertrude Elion) or in home labs (e.g. Nobelist Rita Levi-Montalcini)

On the other end of things, some of the positive posters are a bit worrisome. One makes the quite apropos comparison of this to having a home darkroom, but gets their chemicals confused -- while the stop solution is indeed just acetic acid, the fixer is not "drinkable but dull" but rather cyanide-based (cyanide is a great remover of silver, which is the job of the fixer).

There are also a number of posters who suggest that this information might be used against her by an insurance company or that it would be illegal to withhold it from same. Whether this would be prohibited by GINA isn't considered; I'm guessing the poster's aren't familiar with it. Another poster relishes the idea that
Perhaps she objects to the greed of her peers at Harvard who are charging people for the opportunity to get similar bio data - See http://www.genomeweb.com/blog/round-100.
-- which is bizarre, given that the very GenomeWeb article mentions that these tests are free to participants!

Regardless of how poorly informed or quick to leap to conclusions some of these folks are, this is indeed the landscape of public opinion, at least as plumbed by response to this article. It would suggest that there is a lot of educating to do & that it will be an uphill battle. To a lot of people, science means formal labs and formal training and labs mean dangerous chemicals that might explode.

Sunday, May 03, 2009

The New Gig

I've always been a fan of the space program and I like movies, so when a movie astronaut speaks I listen. Since Beyond Genomics changed it's name to BG Medicine, I can only interpret the advice as directing me to Infinity Pharmaceuticals.

Seriously, tomorrow I start at Infinity. Infinity has a number of anti-cancer programs which it is exciting to be joining. Of course, having drugs in the clinic can be a rocky ride; the day I agreed to go was the day a clinical trial was halted, and Infinity's stock fell 30% (or does somebody on Wall Street just not like me?)

Strange but true story: The day of my interview, a new Netflix disc was scheduled to arrive. The title: Infinity. Spooky!

As far as this space, there will probably be some subtle shifts. I'm probably a little too careful about not posting directly around where I'm working, but that is my habit and so areas such as cancer genomics may see less action. Infinity, as mentioned above, is public & so one must follow certain rules.

On the other hand, that still leaves a lot of biology to comment on. I probably will mine more of synthetic biology, a lot of genomics/proteomics/younameitomics and evolution. Computational stuff I'm working on -- plus some old interests that were lit anew during my time out. Plus some of my learnings from that time, where I set up and then dismantled a trans-Pacific consulting empire (yep! often had to cross Pacific Street to go from one client to another).

Tuesday, April 21, 2009

Is Codon Optimization Bunk?

There is a very interesting paper in Science from a week ago which hearkens back to my gene synthesis days at Codon. But first, some background.

The genetic code (at first approximation) uses 64 codons to encode 21 different signals; hence there are some choices as to which codon to use. Amino acids and stop can have 1,2,3,4 or 6 codons in the standard scheme of things. But, those codons are rarely used with equal frequency. Leucine, for example, has 6 codons and some are rarely used and others often. Which codons are preferred and disfavored, and the degree to which this is true, depends on the organism. In the extreme, a codon can actually go so out of favor it goes extinct & can no longer be used, and sometimes it is later reassigned to something else; hence some of the more tidy codes in certain organisms.

A further observation is that the more favored codons correspond to more abundant tRNAs and less favored ones to less abundant tRNAs. Furthermore, highly expressed genes are often rich in favored codons and lowly expressed ones much more likely to use rare ones. To complete the picture, in organisms such as E.coli there are genes which don't seem to follow the usual pattern -- and these are often associated with mobile elements and phage or have other suggestions that they may be recent acquisitions from another species.

A practical application of this is to codon optimize genes. If you are having a gene built to express a protein in a foreign host, then it would seem apropos to adjust the codon usage to the local dialect, which usually still leaves plenty of room to accommodate other wishes (such as avoiding the recognition sites for specific restriction enzymes). There are at least four major schemes for doing this, with different gene synthesis vendors preferring one or the other

  • CAI Maximization. CAI is a measure of usage of preferred codons; this strategy tries to maximize the statistic by using the most preferred codons. Logic: if these are the most preferred codons, and highly expressed genes are rich in them, why not do the same?

  • Codon sampling. This strategy (which is what Codon Devices offered) samples from a set of codons with probabilities proportional to their usage in the organism, after first zeroing out the very rare codons and renormalizing the table. Logic: avoid the rare ones, but don't hammer the better ones either; balance is always good

  • Dicodon optimization. In addition to codons showing preferences, there's also a pattern by which adjacent codons pair slightly non-randomly. One particular example; very rare codons are very unlikely to be followed by another very rare codon. Logic: even better approach to "when in Rome..." than either of the two above

  • Codon frequency matching. Roughly, this means look at the native mRNA and its uses of codons and ape this in the target species; a codon which is rare in the native should be replaced with one rare in the target. Logic: some rare codons may just help fold things properly


A related strategy worth mentioning are special expression strains which express extra copies of the rare tRNAs.

There is a lot of literature on codon optimization, and most of it suffers from the same flaw. Most papers describe taking one ORF, re-synthesizing it with a particular optimization scheme, and then comparing the two. One problem with this is the small N and the potential for publication bias (do people publish less frequently when this fails to work?). Furthermore, it could well be that the resynthesized design changed something else, and the codon optimization is really unimportant. A few papers deviate from this plan & there has been a hint from the structural genomics community of surveying their data (as they often codon optimized), but systematic studies aren't common.

Now in Science comes the sort of paper that starts to be systematic

Coding-Sequence Determinants of Gene Expression in Escherichia coli
Grzegorz Kudla, Andrew W. Murray, David Tollervey, and Joshua B. Plotkin
Science 10 April 2009: 255-258.


In short, they generated a library of GFP variants in which the particular codon used was varied randomly and then expressed these from a standard sort of expression vector in E.coli. The summary of their results is that codon usage didn't correlate with GFP brightness (expression), but that the key factor is avoidance of secondary structure near the beginning of the ORF.

It's a good approach, but a question is how general is the result. Is GFP a special protein in some way? Why do the rare tRNA-expressing strains sometimes help with protein expression? And most importantly, does this apply broadly or is it specific to E.coli and relatives?

This last point is important in the context of certain projects. E.coli and Saccharomyces have their codon preferences, but if you want to see an extreme preference, look at Streptomyces and its kin. These are important producers of antibiotics and other natural product medications, and it turns out that the codon usage table is easy to remember: just use G or C in the 3rd position. In one species I looked at, it was around 95% of all codons followed that rule.

This has the effect of making the G+C content of the entire ORF quite high, which engenders further problems. High G+C DNA can be difficult to assemble (or amplify) via PCR and it sequences badly. Furthermore, such a limited choice of codons means that anything resembling a repeat at the protein level will create a repeat at the DNA level, and even very short repeats can be problematic for gene synthesis. Long runs of G's can also be problematic for oligonucleotide synthesizers (or so I've been told). From a company's perspective, this is also a problem because customers don't really care about it and don't understand why you price some genes higher than others.

So, would the same strategy work in Streptomyces? If so, one could avoid synthesizing hyper-G+C genes and go with more balanced ones, reducing costs and the time to produce the genes. But, someone would need to make the leap and repeat Kudla et al strategy in some of these target organisms.

Wednesday, April 15, 2009

Sequencing's getting so cheap...

Here's a decidedly odd gendanken experiment which illustrates what next-gen sequencing is doing to the ocst.

A common way of deriving the complete sequence of a large clone is shotgun sequencing -- the clone is fragmented randomly into lots of little fragments. With conventional (Sanger) sequencing these fragments are cloned, clones are picked and each clone sequenced. By using a universal primer (or more likely primer pair; one read from each end), a lot of data can be generated cheaply.

If you search online for DNA sequencing, a common advertised cost is $3.50 per Sanger read. This probably doesn't include clone picking or library construction, but we'll ignore that. Read lengths vary, but to keep the math simple lets say we average 500 nucleotide reads, which from my experience is not unreasonable, though very good operations will routinely get longer reads.

So, at that price and read length it's $7.00 per kilobase of raw data. For shotgunning, collecting 10X-20X coverage is quite common and likely to give a reasonable final assembly, though higher is always better. At 10X coverage, that means for each 1Kb of original clone we'll spend $70.00.

Suppose we have an old cosmid -- which is about 50Kb of DNA including the vector. So to shotgun sequence it with Sanger sequencing, if building & picking the library were free, would be around $5200 for 15X coverage. Pretty cheap, right?

Except, for a measly $4700 you can have next gen sequencing of it (and that actually includes library construction costs). 680Mb of next gen sequencing -- or 1172X coverage. Indeed, if you left the E.coli host DNA in you'd still have well in excess of 100X coverage of E.coli plus your cosmid. So if you had multiple cosmids, you could actually get them sequenced for the same price, assuming you can distinguish them at the end (or they just assemble together anyway)!

Sequencing so cheap you can theoretically afford 99% contamination! Yikes!

Of course, it's unlikely you'd really want to be so profligate. Rather than resequence E.coli, you could pack a lot of inserts in. But it does underline why Sanger sequencing is quickly being relegated to a few niches (for example, when you need to screen clones in synthetic biology projects) & the price of used capillary sequencers is reputed to going south of $30K.

Sunday, April 05, 2009

Two Myeloma Patients

TNG and i closed out the ski season a week ago. It's some great time together, but it also ends up being at times a bit of a solitary activity, leaving lots of time to think. Sometimes it's when he's in a lesson, but in general skiing is contemplative for me. It needs to be; if I think too hard about my technique I end up crashing spectacularly. I guess when it comes to skiing, I'm a Taoist.

Ideally, I'm thinking about beautiful scenery or admiring TNG's developing technique. But other thoughts invariably intrude, and more than a few times I find myself pondering multiple myeloma, as on a ski trip last year I met the second myeloma patient I ever knew.

For the last several years at Millennium, myeloma occupied a lot of my time. Because myeloma was the first disease where Millennium found success, this was natural. It was also two pronged. One goal was to better understand Velcade in myeloma to further develop the drug in that disease, such as going for first line treatment. But it was also seen as an important opportunity to learn how the drug works, so that intelligent decisions could be made about other cancers.

At quarterly company meetings there were often myeloma patients onstage to tell their story. One that particularly stuck in my mind was an oncology nurse who developed the disease, tried Velcade and almost immediately switched to something else; she experienced the full brunt of peripheral neuropathy while on Velcade and could tolerate it. In some ways this seems like a curious choice to inspire your troops, but it did exactly that. We had done good things, but needed to do better. And most people came out of those meetings pretty charged up.

However, these were big presentations on stage, not face-to-face meetings. Even though I occasionally got to rub shoulders with some of the clinical giants of the field, I never met any patients. Not surprising, but somewhat noteworthy.

Last year we were away in New Hampshire for a ski weekend & I struck up a conversation with a group in the lobby. Somehow, it arose that one of their number had cancer, and I couldn't help but ask what sort & it turned out it was myeloma. As is common, someone who should have been enjoying their golden years was instead faced with this dread disease.

Myleoma most commonly strikes late in life. Myleoma arises in most, if not all, cases when a DNA rearrangment occurs within a cell which creates antibodies. Certain rearrangements are necessary for the correct creation of antibodies; these alterations lie at the heart of the system for creating a wide array of antibodies to defend against a wide array of invaders. But sometimes the cut-and-paste glues the wrong two things together, and that can drive a myeloma. Myleoma shows up most commonly late in life. Perhaps this is because the switching machinery loses its edge as life goes on, or perhaps it is just that eventually the wrong number comes up on the immunologic dice.

My chance meeting in that lobby was particularly poignant as it had not been long before that I had met my first myeloma patient, and that was no random stranger. Every year growing up the family would travel west to see my grandparents in Kentucky, and in one direction or the other we would stop by my aunt and uncle in Ohio. My cousins are much older than I, so it was often just my aunt & uncle and my family. With no children to play with, I didn't play a lot of board games there. But I had a lot of fun, as my uncle took me to the Reds or his garden patch or to see a train. He'd murder me in croquet. He took me to the print shop at his high school & show me how to print up a bunch of notepads. In later years, I'd feel humble after failing to explain to him what I did for a living, realizing I had slipped deep into the land of jargon. And he'd try to convince me that no bumpkin from AVon could have written those plays; much more likely they came from the Earl of Oxford.

Eventually, I flew the nest and I no longer saw them on an annual schedule, but he never missed a family wedding and I even made it to one family reunion. I'd avidly read his Christmas letter to catch up with the rest of the clan. Of course, you couldn't believe everything in it, as he was a notorious prankster. Yes, those birthday checks with the crazy name were real ("Fifth Third Bank" -- who's going to believe that?), but he had not been truthful about his WW2 service -- the Army probably doesn't even have dedicated mess kit repair units. No, he actually was a decorated signalman. Only once did he tell a story that didn't happen stateside; it is more than a little guilt for me that I can't remember any details. It wasn't that I wasn't listening, but somehow it didn't stick.

So it really hit home when I found out that this great man, who had given so much to me and others (he was recorded weekly reading for the blind) had been diagnosed with myeloma. It seemed a bit ironic that now that I had a strong personal motivation, I was no longer working in the field. But I did have a long phone chat with him & tried to be useful, though he had been well briefed by his doctor and there wasn't a lot for me to do. I mentioned things like stem cell transplants, and he remarked that he was eighty four, and while he wasn't going to give up there were limits to what he would do; life quality was important.

A goal of modern oncology is to have a patient die with their disease, not of their disease. I do not know how to score this case. About a month and a half before our ski trip a cerebral hemmorhage felled my uncle. Was this myleoma's fault? Thalidomide's? Or a not unlikely result for an elderly american in generally good shape? We cannot cheat death forever, and something must end life. On the other hand, in no way could myleoma be given a free pass -- it certainly gave him undeserved misery near the end.

About a month and a half after the ski trip, I attended a very nice memorial service for him, where dozens of his former students turned out to testify how he had changed their lives. We learned things we never knew about him (he played the tuba?) and remembered the good times.

Whenever I think about myeloma now, I can't help but remember him. I also remember that patient I met in the hotel, and sometimes I still can feel the wetness of his parting friendly gesture on my hand. I didn't ask what medication he was on, but I can assume it wasn't Velcade or Revlimid. Might he been on thalidomide? If so, do standard poodles need to go through STEPS?

Saturday, April 04, 2009

Too Many Closings

These are dire economic times, with the signs all around. In the town where I live, several stores have closed in the town center -- and my favorite imported goodies store has frighteningly bare shelves & a nearly empty cheese cooler. I fear the worst.

On a much bigger scale, today's Boston Globe carried a headline that the same newspaper may close unless significant labor concessions are made by its unions, confirming previous speculation that the Globe was hemorrhaging money from its owner the New York Times. This week marked another round of cutbacks in the newsroom, and it seems about every 6 months or so another redesign occurs to attempt to hide (and cope with) the shrinking number of pages.

One of those redesigns has been the elimination of a separate business section, with instead the business section contained within the Metro section -- they are physically, but not logically merged. And as many may know, yesterday's had an obituary for my recent employer, which also noted the recent or imminent demise of several other biotechs.

It should be noted that the Globe seems a tad slow on the news. Codon, of course, unloaded the majority of its staff two weeks ago. Okay, nobody squealed loudly. It is a bit more striking that the Globe article stated that the ultimatum to the unions had been delivered Thursday -- how could nobody at the newspaper been tipped off to that!

The possibility of losing the Globe is very sad too me, as I truly have newspaper in my blood. No, I don't mean my family has a history of careers in the industry (though we do seem to dabble in it); I mean I've been reading the newspaper since I can remember, so I've certainly assimilated a good deal of into my cellular structures! I too dabbled in the industry, delivering for one paper (which ended operations shortly after I quit) and doing high school sports photography and reporting for two others (one of which also appears to be bust). I also edited my high school's newspaper, so I can take a tiny claim to once being an ink-stained wretch (is it possible to be stained with bits?). All through college and beyond, I've always had a subscription to the daily paper. While various deficiencies in local delivery have in recent years tested my loyalty, I still subscribe. Perhaps not much longer -- but not by my choice.

I do want to allay any concerns that this particular enterprise might be headed for a similar fate. Fear not dear readers! While revenue has stayed completely flat, in these times that must be considered an accomplishment. Omics! Omics! balance sheet remains out of the red -- as it always has. And just think -- any future revenue would mean infinite revenue growth!

Wednesday, April 01, 2009

New DNA Service Makes Dates -- Via Dogs!

Ever notice how a couple sometimes resemble the family pet? A new startup company believes that this is the secret to dating success, and that DNA typing is a way to guarantee romantic bliss.

Date My Dog's DNA will test both you and your dog's DNA and then apply proprietary computer algorithms to find your perfect match. Dogless individuals can also be typed, though they will only be matched with someone who has registered their dog in the service.

Why should this work? According to President and CEO Jack Russell, our choice of dog is driven by fundamental personality traits. By examining the DNA, traits can be matched between dog and human. "While it is useful for purebred canines, the real power comes with mixed breeds, as you may not realize which tendencies you are keying in to", says Russell. "Just imagine", he continues, "all the painful breakups due to date-dog incompatibilities; we believe we can prevent most of these".

Can the technology be put to other uses? Vice President for Marketing K. Charles Cavalier suggests that once pre-conception DNA screening becomes routine, they plan to move into this area. Would this be eugenics hidden behind a wagging tail? Replies Cavalier: "We think each couple will choose very differently. For example, if you have two border collies you might enjoy a bright but hyperactive child. On the other hand, if you have a bloodhound you might prefer a quiet, contemplative child who likes to observe the world." Continues Cavalier "We think parent-child bonding is critical to a child's mental and social development. You've already bonded with your dog; why not leverage that bond into a better one with your child?"

Seed funding for the company has been provided by the Kaltnassnase Fund.

Tuesday, March 24, 2009

Codon's Type IIS Meganuclease

When I joined Codon Devices, I swore I would not use this space to shamelessly tout any results from the company. It turned out my resolve was never tested. It's not that there weren't interesting results being generated in the company, but that in one way or another they never became public. Some results were never meant to be public, but were within collaborations, whereas some intended to be public got held up by one snag or another.

Perhaps the universe does like to play subtle jokes on us. Now that I'm out, so is the first publication from the company, describing the engineering of a Type IIS restriction enzyme with a very large recognition sequence.

TypeIIS restriction endonucleases are handy for many purposes, but particularly for gene construction techniques. Whereas most restriction enzymes recognize and cut at the same site, Type IIS enzymes recognize a specific site but then cut a precise distance away (or cut at perhaps two different offsets; note Fig 2 of this reference). This is handy because it allows one to design two pieces to come together (via the sticky overhangs generated by the enzyme) but without the recognition sequence in the final product. Hence, Type IIS enzymes can allow virtually any sequence to be built.

The catch, of course, is that it is challenging to build in this fashion a sequence which itself contains the Type IIS recognition sequence. Ideally, these sequences would be very long and hence unlikely to appear by chance. Unfortunately, the known Type IIS enzymes almost all have 5 or 6 basepair long recognition sequences, which are not terribly rare once you get in the multiple kilobase range, and are certainly not rare if you want to build chromosome-sized DNA.

So the goal of a number of efforts has been to build a Type IIS restriction enzyme which has a very long recognition sequence. Enzymes called homing endonucleases have huge recognition sequences, with effective lengths of 12 or more basepairs (the actual lengths are greater, but there is also some positions which are not fully fixed to a particular nucleotide -- hence the term effective length). The advance of Lippow et al is that a new level of precision was obtained in the cutting sites, a level of precision compatible with gene engineering.

In a sense, the problem is analogous to that of a K9 unit. The handler has a potentially vicious dog which she would like to apply precisely. Give the dog too short a leash and you can't deploy its teeth; give it too long a leash and the teeth may sink into places other than where you want them to.

So what Lippow et al did is build different protein linkers to tie the DNA recognition domain (handler) to the cleavage domain (dog) from the Type IIS enzyme FokI. By run-off Sanger sequencing, in which the polymerase is allowed to extend to the end of a DNA strand, they showed that cutting is precise, particularly for one of the specific enzymes generated. The dog, alas, is not under complete control; some random off-site cutting is observed. But it is a step forward.

One last hitch: to be particularly useful, one really needs at least two Type IIS meganucleases, and ideally many. Alas, this paper provides only one -- but it is a roadmap to building more, as there are a number of other homing endonucleases which could be potentially used for recognition modules. Alternatively, a number of papers have generated Sce-I variants with different recognition specificities, so by introducing these mutations into the CdnI enzyme reported here should allow a new set of Type IIS meganuclease specificities.

Monday, March 23, 2009

JAK2 haplotype promotes JAK2 mutation

An interesting trio (Klipivaara et al, Jones et al, Olcaydu et al) of abstracts from the Nature Genetics Advance Online Publications site (alas, I don't have fulltext access without traipsing in to MIT or Harvard to use the library, but more on that soon).

JAK2 (Janus Kinase 2) is a protein kinase important in hematopoeitic cell function, and a particular mutation was shown several years ago to result in several distinct but related myeloproliferative disorders.

In these papers, particular haplotypes (given only the abstracts, its impossible to determine if there is complete agreement on which ones) lead to a higher risk of the disease-causing V617F mutation. What is quite striking is that the mutation occurs in cis to the haplotype, that is to say the same chromosome with the haplotype tends to be the one bearing the mutation.

The explanation favored by the papers appears to be that the haplotype somehow creates a favorable DNA context for causing the mutation. If the mutations showed up in trans (on the other chromosome) just as often, one might contemplate a mechanism whereby the haplotype somehow increases the selective advantage of V617F -- perhaps, for example, by causing incorrect JAK2 expression.

It will be fascinating to see this story play out -- of what DNA mutational or repair mechanism does the haplotype shift the balance? And, now that this is precedented you can be sure there will be a lot of searching for other examples. A quick screen would be to look for mutational haplotypes which contain known oncogenic mutations, and then go screening somatic samples for those haplotypes. Of course, with sequencing getting so cheap, the not too distant future will have lots of paired somatic and tumor complete genomes to compare.

Friday, March 20, 2009

TGA Codon


Well, after a few successful readthroughs I've been hit with my career's Release Factor again.

Looking on the bright side, this will give me time & focus to write here and to tackle two invited articles.

I'm also entertaining short-term consulting gigs in the Boston area (or, with travel expenses included, in cities with resident Ailuropoda melanoleuca :-) But that's just a stop-gap; what I'd really like is a permanent position to again do tackle interesting scientific questions in the interface between biology and computing

Wednesday, March 18, 2009

One helix to teach them all, and in the taxonomy bind them?

I originally saw this last summer in some free tourist guide, and neglected to write on it, but a little googling verified my memory. There is a game show on one of the channels now called "Are you smarter than a 5th grader", in which adults go up against 5th graders in a quiz show format, with the questions supposedly representative of that sample of elementary school. When I saw this particular item, my eyes rolled at first but then I pondered some more -- and realized that while I'd probably stick to my original position, it is a bit more nuanced than my first reaction.

Name 3 of the 5 kingdoms.


Okay, this was enough to generate an autonomic response. Back in high school we probably a good chunk of a class going over various Kingdom proposals. I don't have that textbook, but one of a similar strata would be my freshman bio textbook, Biological Science by Keeton & Gould, 4th Edition. K&G (p.1019) outlines eight different kingdom systems, ranging from 2 to 8 kingdoms.

Now, of course, one must ask what exactly is a kingdom? Ideally a kingdom would consist of a bunch of organisms with a common theme (which wouldn't be simply the lack of the all the themes of other kingdoms), all organisms with that theme would be in the kingdom, and no extant organism outside that kingdom would trace its ancestry to a member of that kingdom. At least, off the cuff, that is definition I would give.

So which one induced a reflex? It is the five kingdom system: Plant, Fungi, Protists, Animals & Monera, which it turns out is the one Keeton & Gould used for organizing their survey of the living world.

Now, it isn't an awful system, particularly back in the late '80s when I had it. Monera are all the single-celled thingies which lack a nucleus. Eukaryotes are what we know best, so they are subdivided into single celled (Protists), multi-cellular with cell walls & photosynthesis (Plants), multi-cellular, with cell walls but never photosynthetic (Fungi) and multi-cellular with no walls (Animals).

In that era, issues with these grouping were certainly recognized and taught. Yeasts clearly were related to Fungi, so they went there despite unicellularity. Some plants lack photosynthesis (e.g. dodder), but clearly this is a late loss and they belong in Plants. Protists is a handy way to lasso all sorts of traditional problems such as Euglena, which both photosynthesizes and moves.

But, what was just emerging when I was taught these things, but is now quite evident, is that the non-nucleated world is really two worlds, Eubacteria and Archea. While they both have many similarities (such as mostly circular chromosomes), they are very, very different in other fundamental cellular processes, such as RNA transcription. Plus, now we have DNA & RNA phylogenetic methods which show them to have diverged very long ago.

There are other issues DNA methods have illuminated. Protists are not an evolutionarily coherent group but are instead a mishmash of various lineages ("polyphyletic"). Eukaryotes as a whole don't fit a simple tree lineage, due to multiple endosymbiont captures resulting in organelles such as mitochondria and chloroplasts (and perhaps more).

Which asks the question: what should we be teaching 5th graders? My reflex reaction is that we shouldn't teach them things they'll need to unlearn later, and the Monera kingdom concept is just not a very good one in the light of molecular phylogenies. But, what my further pondering brought up is one goal of science education is to teach students to methods of science rather than just rote facts. Given a microscope or some photographs, it is pretty easy to teach a young student how to classify organisms into the 5 kingdom system. Trying to explain why archea and eubacteria should be in different groups isn't so easy. Okay, a lot of archea have pretty wierd lifestyles (insanely low pH, even more insanely high heavy metal content, boiling water, etc), but not all do. Just being strange to us isn't really a useful way to categorize.

On the other hand, perhaps at least the notion of molecular classification can be introduced early. Granted, it's an N of 1, but I've successfully shown that you can teach the concept to a 3rd grader. It's also something which can be easy to diagram out & count -- with (obviously!) only a subset of informative positions. And in the end, wouldn't that be the best science lesson of all -- that things which look superficially alike may have an underlying, nearly hidden great difference?

Of course, the hardest part of any change is getting change. It appears that a generation of science teachers have been taught the 5 kingdom system, and so will need to be updated. Numerous textbooks probably also encapsulate this archaic (but not archean! :-) concept. Probably the hardest to change will be those statewide curriculum standards or standardized tests which contain these phylogenetic fossils.

Sunday, March 08, 2009

The next level in genomics term papers

I've been intrigued for a few months now since hearing about a St. Louis company called Cofactor Genomics. Right on their front webpage they advertise they will generate & assemble 680Mb of sequence (from an Illumina machine) for the paltry sum of $4.7K.

Wow! That would fit on my credit card when I was a graduate student (though it would have been a few months stipend). 680Mb is 100+X coverage of an E.coli-class genome, or about 50X coverage of Saccharomyces. It's even well over 0.5X coverage of an awful lot of interesting eukaryotes.

As an aside, I feel obligated to stress that I don't have any personal stake in, or direct relationship with, Cofactor Genomics. I also have no experience with them or any of their competitors. It's just the ease of accessing their pricing matrix makes them easy to talk about.

At those prices, the idea of doing my own personal genome project can't be easily shooed away. Not a Personal Genome Project -- I worry I'd develop genomania -- but some small genome sequenced on my whim. There's probably still not a shortage of interesting genomes in species I could easily & safely grow up with some forbearance of my shop's management or at a friendly academic. There must be some left; there are even some industrially-interesting E.coli strains that seem to lack public sequences. However, even if it wouldn't violate my town's zoning laws to do it in my basement, neither growing biological samples nor the $5K budget would fly with my spouse.

So I'll float a different idea. My only wish is that anyone who tries it post back here, and if you're already doing the same thing I invite your response as well. If I can't do it, why not some class?

Now $5K isn't chicken feed. I'm sure that is far beyond the typical budget for lab experiments in a college class, let alone a high school. Maybe a donor could step in, but these days that's a particularly tough challenge to find. But suppose the cost were spread over a lot of students?

One scenario would be for a very large university to make this the project for an entire class. A really huge state school I would guess could have 500+ students a year taking first-year biology. Now we're talking less than $10/student -- perhaps still a significant hit (what is a typical per student budget for such a course?). Each student would get about 1/500th of the genome as their very own research project.

At a smaller school, could a genome project become a departmental initiative? A bioinformatics class could set up the analysis pipeline & develop reporting tools. Biochemistry class could map the ORFs to the known biochemical pathways and identify both missing pathways and predicted novel (to the species) enzyme activities. Genetics classes could focus on operon structure or identifying possible regions recently transferred horizontally from another species. Evolution classes could tackle that, or building a bazillion gene trees. A bit of a stretch to work this into a human physiology curriculum, though a comparative look at how another biological system manages homeostasis isn't completely absurd.

Of course, when it comes time to publish it will be a very long author list!

I think I've heard of a genome project being run as an undergraduate effort, but I'm guessing a lot of that involved doing the actual sequencing. While there's merit to that, these days even with free labor, large-scale Sanger sequencing isn't cost competitive. Perhaps some departments have one of the next-gen machines & are willing to let some undergraduates play with them -- but I'm guessing that's pretty rare (like a NotI site in an AT-rich genome).

Will sequencing costs ever crash low enough that someone will sequence a genome for an grade school science fair project? I'm not holding my breath, but I certainly wouldn't rule it out.

Tuesday, March 03, 2009

MGH To Mutation-Type All Cancer Patients

Today's Boston Globe carried a front page item that Massachusetts General Hospital is planning to screen all cancer patients for a battery of about 110 common cancer mutations in 13 genes. MGH is apparently the first hospital to go this in depth on every patient.


This is an exciting push forward into personalized medicine, and it makes sense for a teaching hospital such as MGH to leap into the void. This sort of typing makes intuitive sense, but (as the article states) its clinical value remains to be proven. A few patients, such as one profiled in the piece, will have radical changes in treatment which benefit the patient -- in the example a woman had a relatively rare kinase fusion (to EML4-ALK) for which an investigational drug was available -- and she responded spectacularly. But for many patients, the mutations found won't change care because there isn't a known way to target their mutation spectrum.

But, the huge value will be longer term as MGH builds a database of mutations and responses to treatment -- such a database will almost certainly provide new ideas for treatment, ideas which a research-focused hospital will be willing & able to try out. As more mutations are linked to cancer outcomes and screening costs come down, surely the panel will be expanded. MGH is also presumably planning to screen patients on both initial diagnosis and after relapses, so an increasingly rich database of mutations appearing during cancer progression will emerge.

It will be interesting to see how many other hospitals here -- and elsewhere -- follow. Boston has a small herd of top-notch hospitals and most (if not all) have significant cancer centers (with one, Dana Farber, completely focused on the subject). Ideally the results of many such screens could be pooled into one or more common databases, with of course the need to protect patient confidentiality.

One barrier may be cost. The Globe article pegs it at $2000, and states it is unclear if insurers will pay -- in the past they have demanded proof of clinical value. While that isn't an indefensible position, it would be in their self-interest to chip in -- perhaps a prorated amount. First, it's lousy PR to not pay for diagnostics that are likely to work (and the drumbeat for single-payer is pretty much constant in the same paper). Second, the tests are likely to provide useful information some fraction of the time -- and in those cases may provide cost savings. MGH is apparently considering eating the cost or asking the patients to kick some in.

MGH may also be setting the price point for such services. $2K isn't far from the $4K that Complete Genomics claims it will be able to run a complete genome in the not-too-distant-future. $2K probably is in the ballpark already for sequencing off capture arrays.

Of course, budgets for diagnostics aren't infinite. Will such initiatives be knocking elbows with other genomics-driven diagnostics, such as the existing array-based assays (e.g. OncotypeDX, CupPrint)? Will greater value come from methylation profiling or other assays which evaluate markers not available to current sequencing technologies? Time will tell.

Saturday, February 28, 2009

Time to deal with the IRS (Internal Reaction Service)

It's that time of year again -- when those of us in the U.S. must deal with numbered forms and lettered schedules. In this light, I wish to share a recent piece of correspondence:

Dear Dr. Robison:

After great difficulty (must your handwriting be so atrocious?) I have reviewed the accounts at your business enterprise. I regret to inform you that two of your accounts, with ATP Corp and NAD(P)H Ltd, are grossly out of balance. While you are running a deficit with the former and a surplus with the latter, as we have discussed previously these separate accounts cannot be merged. Your enterprise is doomed to failure (and I think it goes without saying that some sort of Madoffian scheme will not be countenanced by me). You must bring these into balance or your enterprise would fail, never mind the horror of trying to explain this in an audit.

I realize I am not qualified to comment on the technical aspects of your effort. However, may I suggest you get out of the lab more and get some fresh air? Perhaps some oxygen would stimulate your activity in a most productive way?

Sincerely,

Colin Escherich, C.P.A.

Sunday, February 08, 2009

Any Genome Sequence You Want, As Long As It's Human

It's been interesting reading dispatches coming from bloggers Dan Kobolt and Daniel MacArthur who are attending the Marco Island conference, the big yearly confab on bleeding edge sequencing technology. How have I resisted this conference for so long, especially with the climate draw???

One company that is again receiving a lot of attention is Complete Genomics, which is proposing to build a set of sequencing centers to sequence human genomes at $5K a pop. What is striking is that their business model is to sequence only human genomes and nothing else, which particularly surprised Daniel MacArthur at Genetic Futures.

As a biologist and someone fascinated with all genomes, such a policy is not a welcome thought. But, as someone who has worked in an industrial high-throughput production facility, I think I can reverse engineer the logic pretty well (I have no connections to or inside information from the company).

Why would you want to do this? Simplicity. By focusing on only a single genome, all sorts of simplifications are created. Complexity costs significant money & time, and it is often what seems trivial that ends up being very costly. Just allowing a second genome in the door creates all sorts of additional work on the software side, and if that second source requires different sample prep that's an additional headache on the lab side.

Having only one genome kicking around also creates some interesting opportunities for quality control both for each sample and for the whole factory (which is what they are talking about building: a sequencing factory). One genome means only one reference sequence to compare against & one set of pathological problems for their assembly algorithm to be fortified against. One genome also means that if you see another genome in your data, you know something is wrong -- and if you see the same one genome repeatedly you may have a factory-wide problem.

"Any color you want so long as it is black" got Ford to the top of the U.S. automotive heap, but it didn't keep them there -- I believe that GM's offering colors helped push them into first. So will the market support Complete's vision? I think it can.

Complete is apparently talking about running a million genomes per year. At $5K each, that would be $5 billion, some serious cash flow. I don't know if they've estimated the market correctly, but it doesn't seem ridiculous. If a large fraction of the world's wealthy decide to sequence their genomes (and their children's too) and if sequencing tumors becomes semi-routine, a few million human genomes a year doesn't seem totally ridiculous. Of course, Complete would have to fight with all the other players for a share.

That implies a question: what comparable markets are they giving up? I'd love to see broader "zoonomics", where we go through the living world sequencing everything, but that's all going to be grant funded. Smaller genomes may also be completely mismatched with this sort of technology -- without some sort of multiplexing (complexity!). Similarly, it's not easy to see some big commercial market for metagenomics -- it will remain fascinating & there's no end to the ecological niches to explore, but who in the private sector is going to pony up major money for it? Oncogenic mouse models will supply lots of tumors for sequencing, but again probably not a big private sector activity.

The one area I can almost envision is sequencing valuable livestock or agricultural lines to understand their complete makeup. If this were done not only for parentals but for offspring in breeding programs, then perhaps a big market would be generated. But, is it really worth sequencing to completion or will some cheaper technology for skimming the surface suffice? If there is a market, then a logical business direction for Complete might be to do a joint venture or spinout focusing on alternate genomes -- but either the prize would need to be big or the one genome business model failing for that to be worth diverting attention.g