Amongst the news last week is a bit of a surprise: the salmon genome project is choosing Sanger sequencing for the first phase of the project. Alas, one needs a premium subscription to In Sequence, which I lack, so I can't read the full article. But, the group has published (open access) a pilot study on some BACs, which concluded that 454 sequencing couldn't resolve a bunch of the sequence, and so shorter read technologies are presumably ruled out as well. A goal of the project is a high quality reference sequence to serve as a benchmark for related fish, demanding very high quality.
This announcement is a jolt for anyone who has concluded that Sanger has been largely put to pasture, confined to niches such as verifying clones and low-throughput projects. Despite the gaudy throughput of the next-gen sequencers, read length remains a problem. However, that hasn't stopped de novo assembly projects such as panda from apparently proceeding forward. Apparently salmon is even nastier when it comes to repeats.
Still playing the armchair next-gen sequencer (for the moment!), it is an interesting gedanken experiment. Suppose you had a rough genome you really, really wanted to sequence and get a high-quality reference sequence. On the one hand, Sanger sequencing is very well proven. However, it is also more expensive per base than the newer technologies. Furthermore, Sanger is pretty much a mature technology, with little investment in further improvement. This is in contrast to next gen platforms, which are being pushed harder and harder both by the manufacturers as well as the more adventurous users. This includes novel sequencing protocols to address difficult DNA, such as the recently published Long March technique (which I'm still fully wrapping my head around) that generates nested libraries for next-gen sequencing using a serial Type IIS digestion scheme. Complete Genomics has some trick for inserting multiple priming sites per circular DNA template. Plus, Pacific Biosciences has demonstrated really long reads in a next gen platform -- but demonstrating is different than having it in production.
So it boils down to the key question: do you spend your resources on the tried-and-true, but potentially pricey approach or try to bet that emerging techniques and technologies can deliver the goods soon enough. Put another way, how critical is a high quality reference sequence? Perhaps it would be better to generate very piecemeal drafts of multiple species now and then go for finishing the genomes when the new technologies come on line. But what experiments dependent on that high quality reference would be put off a few years? And what if the new technologies don't deliver, in which case you must fall back on Sanger and be quite a bit behind schedule.
It's not an easy call. Will salmon be the last Sanger genome? It all depends on whether the new approaches and platforms can really deliver -- and someone is daring enough to try them on a really challenging genome.
A computational biologist's personal views on new technologies & publications on genomics & proteomics and their impact on drug discovery
Monday, June 29, 2009
Sunday, June 21, 2009
Cancer Genome Sequencing--A (Pessimistic) Interim Analysis
The current issue of Cancer Research carries a very brief (3 pages, with one page mostly tables & figures) review of the first pulse of cancer genome sequencing papers (sub required to read article). While sub-titled 'An Interim Analysis', perhaps a better subtitle would be 'A Uniformly Negative Analysis'.
A full-press cancer genomics project has been a controversial drive, with many bemoaning the huge amount of resources devoted it and believing other avenues would be better suited for enhancing our ability to help cancer patients. But it has gone forward, and a spate of papers over the last year have reported the early results.
The initial papers have covered 4 of the big cancers in terms of incidence and mortality (lung, breast, colorectal and pancreatic) as well as glioblastoma and leukemia. Different studies have taken different tacks. In leukemia, we have the first parallel complete sequencing of a patient and their tumor. Papers in breast, colorectal (together covered in two papers here and here), pancreatic and glioblastoma looked at huge numbers of coding exons in small numbers of patients (11 patients x 18.2Kgenes for breast and colorectal; 21 patients x 20.6Kgenes for glioblastoma; 24 patients x 20.6Kgenes for pancreatic). A lung paper and the other glioblastoma paper looked at ~600 genes, but in larger numbers of patients (188 in lung and 91 in glioblastoma).
Personally, I would take a more nuanced view of the results. I think it is hard to argue that these papers have had a shortage of fireworks there have been some important observations made, which curiously the Cancer Research review ignore completely. In the lung study (which I have studied the closest) these include important exclusion and cooperativity relationships between mutations and a number of novel, druggable candidate driver genes (protein kinases) not previously suspected in lung cancer. In the many genes few patients glioblastoma study, it was the identificaiton of a mutational hotspot in isocitrate dehydrogenase 1 (later found to be present, though less frequently mutated, in isocitrate dehydrogenase 2).
Of course, one thing which is changing rapidly is the cost of doing these studies. Most of these papers used conventional PCR amplification and Sanger sequencing, which I would lowball estimate at $1/well (very lowball, but Sandra Porter caught some serious flak suggesting [as I have] a number much higher than this for the sequencing part, and I don't have the accounting experience to argue -- but I do know people who calculated it at Codon and this would be a very low estimate) -- so those studies looking at nearly every coding exon were at least a quarter million per patient (those 20+K genes explode out to about a quarter million exons). Clearly this isn't how things will tend to be done going forward; Illumina will now blow away genomes for $48K each and other companies are now quoting even lower. This is still well in excess of the per patient estimate for the very focused studies, and I believe these (particularly the lung study) demonstrate the value of lots of patients, since this started to give the numbers required to look at interactions between mutations.
One of the reasons the Cancer Research authors aren't terribly pleased with the progress is clear: they feel the experiments aren't the correct ones. But whereas some of the flak I had seen directed at the cancer genome sequence concept was instead promoting more functional approaches (such as RNAi library screening), what these authors want (or at least set as the minimum bar of for interesting) is cancer genome screening on an almost monomaniacal scale: thousands if not millions of individual cells from the same tumor! Clearly this would be fascinating, as there is plenty of evidence that tumors are a motley collection of genetically variant cells (but clonal -- all the tumor cells have the same ancestor, but they also are all sloppy DNA copyists). And, as they note, no DNA sequencing technology here now or on the immediate horizon has any shot at a project of this scale.
While I do believe this would be interesting, I'm not as certain it would be informative for patient care. Since many of these mutations are under very little selection, the spectrum of observed mutations is likely to be enormous. Given that there is already a horrendous backlog of characterizing mutations seen in the studies to date (though there has been a paper already functionally characterizing the isocitrate dehydrogenase mutations)
What is particularly strange about this view is that a more reasonable intermediate step would be to look at those cells that do escape the primary tumor (most of the cancer genome papers so far have focused on primary tumors, though the IDH mutations are primarily found in secondary glioblastomas) -- sequence the metastases. Ideally, this would mean finding multiple patients willing to consent to their genome, their primary's genome, and multiple metastases' genomes being sequenced -- the latter quite likely coming from autopsies (otherwise it is a lot of painful biopsying without much hope of helping the patient, an ethically questionable activity). Or, in leukemias one could more easily resequence after each relapse. Such studies would be doable technically and not cost ridiculous (though clearly not chump change either).
There's also the open question as to whether the real fireworks will come from sequencing less studied cancers, such as the recent success in using transcriptome sequencing to identify the probable causative mutation in a rare type of ovarian cancer (see also the News and Views piece). Perhaps we've mined the rich ore out of some of these veins, and it is the less worked seams which will yield fine genomic insights. Time will tell.
A full-press cancer genomics project has been a controversial drive, with many bemoaning the huge amount of resources devoted it and believing other avenues would be better suited for enhancing our ability to help cancer patients. But it has gone forward, and a spate of papers over the last year have reported the early results.
The initial papers have covered 4 of the big cancers in terms of incidence and mortality (lung, breast, colorectal and pancreatic) as well as glioblastoma and leukemia. Different studies have taken different tacks. In leukemia, we have the first parallel complete sequencing of a patient and their tumor. Papers in breast, colorectal (together covered in two papers here and here), pancreatic and glioblastoma looked at huge numbers of coding exons in small numbers of patients (11 patients x 18.2Kgenes for breast and colorectal; 21 patients x 20.6Kgenes for glioblastoma; 24 patients x 20.6Kgenes for pancreatic). A lung paper and the other glioblastoma paper looked at ~600 genes, but in larger numbers of patients (188 in lung and 91 in glioblastoma).
Personally, I would take a more nuanced view of the results. I think it is hard to argue that these papers have had a shortage of fireworks there have been some important observations made, which curiously the Cancer Research review ignore completely. In the lung study (which I have studied the closest) these include important exclusion and cooperativity relationships between mutations and a number of novel, druggable candidate driver genes (protein kinases) not previously suspected in lung cancer. In the many genes few patients glioblastoma study, it was the identificaiton of a mutational hotspot in isocitrate dehydrogenase 1 (later found to be present, though less frequently mutated, in isocitrate dehydrogenase 2).
Of course, one thing which is changing rapidly is the cost of doing these studies. Most of these papers used conventional PCR amplification and Sanger sequencing, which I would lowball estimate at $1/well (very lowball, but Sandra Porter caught some serious flak suggesting [as I have] a number much higher than this for the sequencing part, and I don't have the accounting experience to argue -- but I do know people who calculated it at Codon and this would be a very low estimate) -- so those studies looking at nearly every coding exon were at least a quarter million per patient (those 20+K genes explode out to about a quarter million exons). Clearly this isn't how things will tend to be done going forward; Illumina will now blow away genomes for $48K each and other companies are now quoting even lower. This is still well in excess of the per patient estimate for the very focused studies, and I believe these (particularly the lung study) demonstrate the value of lots of patients, since this started to give the numbers required to look at interactions between mutations.
One of the reasons the Cancer Research authors aren't terribly pleased with the progress is clear: they feel the experiments aren't the correct ones. But whereas some of the flak I had seen directed at the cancer genome sequence concept was instead promoting more functional approaches (such as RNAi library screening), what these authors want (or at least set as the minimum bar of for interesting) is cancer genome screening on an almost monomaniacal scale: thousands if not millions of individual cells from the same tumor! Clearly this would be fascinating, as there is plenty of evidence that tumors are a motley collection of genetically variant cells (but clonal -- all the tumor cells have the same ancestor, but they also are all sloppy DNA copyists). And, as they note, no DNA sequencing technology here now or on the immediate horizon has any shot at a project of this scale.
While I do believe this would be interesting, I'm not as certain it would be informative for patient care. Since many of these mutations are under very little selection, the spectrum of observed mutations is likely to be enormous. Given that there is already a horrendous backlog of characterizing mutations seen in the studies to date (though there has been a paper already functionally characterizing the isocitrate dehydrogenase mutations)
What is particularly strange about this view is that a more reasonable intermediate step would be to look at those cells that do escape the primary tumor (most of the cancer genome papers so far have focused on primary tumors, though the IDH mutations are primarily found in secondary glioblastomas) -- sequence the metastases. Ideally, this would mean finding multiple patients willing to consent to their genome, their primary's genome, and multiple metastases' genomes being sequenced -- the latter quite likely coming from autopsies (otherwise it is a lot of painful biopsying without much hope of helping the patient, an ethically questionable activity). Or, in leukemias one could more easily resequence after each relapse. Such studies would be doable technically and not cost ridiculous (though clearly not chump change either).
There's also the open question as to whether the real fireworks will come from sequencing less studied cancers, such as the recent success in using transcriptome sequencing to identify the probable causative mutation in a rare type of ovarian cancer (see also the News and Views piece). Perhaps we've mined the rich ore out of some of these veins, and it is the less worked seams which will yield fine genomic insights. Time will tell.