A review by me titled "Application of Second Generation Sequencing to Cancer Genomics" is now available on the Advance Access section of Briefings in Bioinformatics. You'll need a subscription to read it.
I got a little obsessive about making the paper comprehensive. While the paper does focus on using second generation sequencing for mutation, rearrangement and copy number aberration detection (explicitly ruling out of scope RNA-Seq and epigenomics), it does attempt to touch on every paper in the field up to March 1st. To my chagrin I discovered just after submitting the final revision that I had omitted one paper. I was able to slide it into the final proof, but not without making a small error. There's one other paper I might have mentioned that actually used whole genome amplification upstream of second generation sequencing on a human sample, though it's not a very good paper, the sequencing coverage is horrid and wasn't about cancer. In any case, it won't shock me completely -- but a lot -- if someone can find a paper in that timeframe that I missed. So don't gloat too much if you find one -- but please post here if you find any!
Of course, any constructive criticism is welcome. There are bits I would be tempted to rewrite if I went through the exercise again and the part on predicting the functional implications of mutations could easily be blown out into a review of its own. I don't have time to commit to that, but if anyone wants to draft one I'd help shepherd it at Briefings. I'm actually on the Editorial Board there and this review erases my long-term guilt over being on the masthead for a number of years without actually contributing anything.
As I state in the intro, in a field such as this a printed review is doomed to made incomplete very quickly. I'm actually a bit surprised that there has been only one major cancer genomics paper between my cutoff and the preprint emerging -- the breast cancer quartet paper from Wash U. I fully expect many more papers to appear before the physical issue shows up (probably in the fall) and certainly a year from now much should have happened. But, it is useful to mark off the state of a field at a certain time. In some fields it is common to publish annual or semi-annual reviews which update on all the major events since the last review; perhaps I should start logging papers with that sort of concept in mind.
One last note: now I can read "the competition". Seriously, another review on the subject by Elaine Mardis and Rick Wilson came out around the time I had my first crude set of paragraphs (it would be stretch to grant it the title of draft). At that time, I had two small targeted projects in process and they had already published two leukemia genome sequences. It was tempting to read it, but I feared I would be overly influenced by it or worse would be paranoid about plagiarizing bits, so I decided not to read it until my review published.
A computational biologist's personal views on new technologies & publications on genomics & proteomics and their impact on drug discovery
Thursday, April 29, 2010
Wednesday, April 14, 2010
The value of cancer genomics
I recently got around to reading the "Human Genome at 10" issue of Nature. One feature, on facing pages, are opinion pieces by Robert Weinberg and Tood Golub on cancer genomics, with Weinberg giving a very negative review and Golub a positive outlook.
Weinberg is no ordinary critic of cancer genomics; to say he wrote the book on cancer biology is not to engage in hyperbole but rather acknowledge a truth; at work we're actually reviewing the field using his textbook. He made -- and continues to make -- key conceptual advances in cancer biology. So his comments should be considered carefully.
One of Weinberg's concerns is that the ongoing pouring of funds into cancer genomics is starving other areas of cancer research and driving talented researches from the field. Furthermore, he argues that the yields from cancer genomics to date have been paltry.
I can't agree with him on this score. He cites a few examples, but is being very stingy. I'm pretty sure the concept of lineage addiction, in which a cancer is dependent on overexpression of a wild-type transcription factor governing the normal tissue from which the cancer is derived, arose from several genomics studies. Another great example is the molecular subdivision of diffuse large B-cell lymphomas; each of the subsets (at least 3 peeled off so far) appears to have very different molecular characteristics.
On a broader scale, the key contribution of cancer genomics is, and will continue to be, to proved a concrete test of the theories generated by Weinberg and others using their experimental systems. For example, Weinberg worked extensively on the EGFR-Ras-MAP Kinase pathway. If we look in many cancers, this pathway is activated. For example, in non-small cell lung cancer (NSCLC), about half of all tumors are activated by KRAS mutations; in pancreatic cancer this may be near 90%. Other members of the pathway can be activated by mutation as well, but not nearly as frequently. In NSCLC, EGFR is another 20% or so but BRAF and MAP kinase mutations are rare. Why? Well, that's a new conceptual puzzle. Furthermore, EGFR-Ras-MAPK pathway mutations don't seem to explain all cancers. Indeed, some potent oncogenes in experimental systems are rarely if ever seen as driving patient cancers.
One example Weinberg mentions as part of the small haul is IDH1. This is a great story uncovered twice by cancer genomics and is still unfolding. IDH1 is part of the Krebs cycle, a key biochemical pathway unleashed on any biology or biochem freshman. Genomics studies in glioblastoma and AML (a leukemia) have uncovered mutations in IDH1; extensive searches to check in other tumors have come up negative (except a report in thyroid cancer). Why the specificity? An unresolved mystery. The really interesting part of the story is that it appears the IDH1 mutations alter the balance of metabolites generated by the enzyme. Unusual metabolites favoring cancer development -- this is a fascinating story, uncovered by genomics.
Another great cancer genomics story was the identification last summer of the causative mutation for granulosa cell tumor (GCT), a rare type of ovarian cancer. This was found by an mRNA sequencing approach.
As I mentioned before, DLBCL had been previously subdivided by expression profiling into distinct groups, which have different outcomes with standard chemotherapy and different underlying molecular mechanisms. The root of one of those mechanisms was recently identified by sequencing, showing a mutation in a chromatin structure regulation protein, a class of oncogenic mutation only recently found by non-genomic means.
Another recent example: using copy number microarrays (which provide much less information but more cheaply), microdeletions targeting cell polarity genes were identified.
Indeed, I would generally argue that cancer genomics is rapidly recapitulating most of what we have learned in the previous three decades of study on what genes can activate tumors by gain or loss of function. This doesn't replace many other things which the classical approaches have discovered, but does underscore the power of genomics in this setting. And, of course, not simply recapitulating but going beyond to identify new oncogenic players and enumerate the roles of all the current suspects.
My own belief is that Weinberg (and others with similar views) are trying to strangle the genomics effort before it can really spread its wings -- I don't mean something sinister by that, just that they are attempting to terminate it prematurely. Some cancer genome efforts indeed have little to show -- but very few have been done on a really large scale. With costs plummetting for data acquisition (though perhaps not for data analysis), it will be possible to sequence many, many cancer genomes and I am confident important discoveries will come in regularly.
What sort of discoveries and studies? There are hundreds of recognized cancers, some very rare. Even the rare ones will have important stories to tell us about human cell biology; they should definitely be extensively sequenced. We also shouldn't be strict speciesists; a number of cancers are hereditary to certain dog breeds and will also have valuable stories to tell. In common tumors, it is pretty clear that many of these definitions are really syndromes; there is not one lung cancer or even one NSCLC, but many. Each is defined by a different set of key genomic alterations. Enumerating all of those will put the various cancer theories to an acid test; the samples we can not explain will be new challenges. Current projects targeting major cancers are aiming to discover all mutations with 10% or greater frequency. I would argue that is a good start; 5% of a major cancer such as lung cancer is still tens of thousands of worldwide cases.
Cancer is also not a static disease; as in the recent WashU paper it will be critical to compare tumors with metastases to identify the changes which drive this process. Metastatic lesions tend to be what kills patients, so this is of high importance. Lesions also change with therapy, with a pressing need to understand those changes so we can devise therapeutics to address them.
All in all, I can easily envision the value of sequencing tens of thousands of samples or even more. Of course, this is what those skeptical of cancer genomics dread; even with the dropping cost of sequencing this will still require a lot of money and resources. Furthermore, really proving which mutations are cancer drivers and which are bystanders -- and what exactly those driver mutations are doing (particularly in genes which we can intuit little about from their sequence) -- will be an enormous endeavour. Cancer genomics will be framing many key problems for the next decade or two of cancer biology.
Of course, mutational and epigenomic information will not tell the entire story of cancer; there are many genes playing important roles in cancer-relevant pathways that never seem to be hit by mutations. Why not is an excellent unanswered question, as is why certain tissue types are more sensitive to the inhibition of specific universal proteins. For example, germline mutations in BRCA1 lead to higher risk of breast, ovarian and pancreatic cancer (with much stronger breast and ovarian risk increases) yet BRCA1 is part of a central DNA repair complex and not some female-specific system. Really fleshing out cancer pathways will take large scale interaction and functional screens -- which Weinberg specifically notes for dread the idea of such a "cancer cell wiring" project. Ironically, such a project is published in the same issue, the results of a genome-wide RNAi+imaging screen for genes relevant to cell division.
Which gets back to the root problem: if we view cancer funding as more-or-less a zero sum game, how much should we spend on cancer genomics and how much on investigator-focused functional efforts. That's not an easy question and I have no easy answer. It doesn't help that I don't even know the sums involved since I am not subject to the whims of grants (I have different capricious forces shaping my career!). But, clearly I would favor a sizable fraction (easily double digit percent) of cancer funding going to genomics projects.
One of the professors in my graduate department, who was actually no fan of genomics, said that a well-designed genetics experiment enables the cell to tell you what is important. Reading cancer genomes is precisely that, enabling us to discover what is truly important to cancer biology.
Weinberg is no ordinary critic of cancer genomics; to say he wrote the book on cancer biology is not to engage in hyperbole but rather acknowledge a truth; at work we're actually reviewing the field using his textbook. He made -- and continues to make -- key conceptual advances in cancer biology. So his comments should be considered carefully.
One of Weinberg's concerns is that the ongoing pouring of funds into cancer genomics is starving other areas of cancer research and driving talented researches from the field. Furthermore, he argues that the yields from cancer genomics to date have been paltry.
I can't agree with him on this score. He cites a few examples, but is being very stingy. I'm pretty sure the concept of lineage addiction, in which a cancer is dependent on overexpression of a wild-type transcription factor governing the normal tissue from which the cancer is derived, arose from several genomics studies. Another great example is the molecular subdivision of diffuse large B-cell lymphomas; each of the subsets (at least 3 peeled off so far) appears to have very different molecular characteristics.
On a broader scale, the key contribution of cancer genomics is, and will continue to be, to proved a concrete test of the theories generated by Weinberg and others using their experimental systems. For example, Weinberg worked extensively on the EGFR-Ras-MAP Kinase pathway. If we look in many cancers, this pathway is activated. For example, in non-small cell lung cancer (NSCLC), about half of all tumors are activated by KRAS mutations; in pancreatic cancer this may be near 90%. Other members of the pathway can be activated by mutation as well, but not nearly as frequently. In NSCLC, EGFR is another 20% or so but BRAF and MAP kinase mutations are rare. Why? Well, that's a new conceptual puzzle. Furthermore, EGFR-Ras-MAPK pathway mutations don't seem to explain all cancers. Indeed, some potent oncogenes in experimental systems are rarely if ever seen as driving patient cancers.
One example Weinberg mentions as part of the small haul is IDH1. This is a great story uncovered twice by cancer genomics and is still unfolding. IDH1 is part of the Krebs cycle, a key biochemical pathway unleashed on any biology or biochem freshman. Genomics studies in glioblastoma and AML (a leukemia) have uncovered mutations in IDH1; extensive searches to check in other tumors have come up negative (except a report in thyroid cancer). Why the specificity? An unresolved mystery. The really interesting part of the story is that it appears the IDH1 mutations alter the balance of metabolites generated by the enzyme. Unusual metabolites favoring cancer development -- this is a fascinating story, uncovered by genomics.
Another great cancer genomics story was the identification last summer of the causative mutation for granulosa cell tumor (GCT), a rare type of ovarian cancer. This was found by an mRNA sequencing approach.
As I mentioned before, DLBCL had been previously subdivided by expression profiling into distinct groups, which have different outcomes with standard chemotherapy and different underlying molecular mechanisms. The root of one of those mechanisms was recently identified by sequencing, showing a mutation in a chromatin structure regulation protein, a class of oncogenic mutation only recently found by non-genomic means.
Another recent example: using copy number microarrays (which provide much less information but more cheaply), microdeletions targeting cell polarity genes were identified.
Indeed, I would generally argue that cancer genomics is rapidly recapitulating most of what we have learned in the previous three decades of study on what genes can activate tumors by gain or loss of function. This doesn't replace many other things which the classical approaches have discovered, but does underscore the power of genomics in this setting. And, of course, not simply recapitulating but going beyond to identify new oncogenic players and enumerate the roles of all the current suspects.
My own belief is that Weinberg (and others with similar views) are trying to strangle the genomics effort before it can really spread its wings -- I don't mean something sinister by that, just that they are attempting to terminate it prematurely. Some cancer genome efforts indeed have little to show -- but very few have been done on a really large scale. With costs plummetting for data acquisition (though perhaps not for data analysis), it will be possible to sequence many, many cancer genomes and I am confident important discoveries will come in regularly.
What sort of discoveries and studies? There are hundreds of recognized cancers, some very rare. Even the rare ones will have important stories to tell us about human cell biology; they should definitely be extensively sequenced. We also shouldn't be strict speciesists; a number of cancers are hereditary to certain dog breeds and will also have valuable stories to tell. In common tumors, it is pretty clear that many of these definitions are really syndromes; there is not one lung cancer or even one NSCLC, but many. Each is defined by a different set of key genomic alterations. Enumerating all of those will put the various cancer theories to an acid test; the samples we can not explain will be new challenges. Current projects targeting major cancers are aiming to discover all mutations with 10% or greater frequency. I would argue that is a good start; 5% of a major cancer such as lung cancer is still tens of thousands of worldwide cases.
Cancer is also not a static disease; as in the recent WashU paper it will be critical to compare tumors with metastases to identify the changes which drive this process. Metastatic lesions tend to be what kills patients, so this is of high importance. Lesions also change with therapy, with a pressing need to understand those changes so we can devise therapeutics to address them.
All in all, I can easily envision the value of sequencing tens of thousands of samples or even more. Of course, this is what those skeptical of cancer genomics dread; even with the dropping cost of sequencing this will still require a lot of money and resources. Furthermore, really proving which mutations are cancer drivers and which are bystanders -- and what exactly those driver mutations are doing (particularly in genes which we can intuit little about from their sequence) -- will be an enormous endeavour. Cancer genomics will be framing many key problems for the next decade or two of cancer biology.
Of course, mutational and epigenomic information will not tell the entire story of cancer; there are many genes playing important roles in cancer-relevant pathways that never seem to be hit by mutations. Why not is an excellent unanswered question, as is why certain tissue types are more sensitive to the inhibition of specific universal proteins. For example, germline mutations in BRCA1 lead to higher risk of breast, ovarian and pancreatic cancer (with much stronger breast and ovarian risk increases) yet BRCA1 is part of a central DNA repair complex and not some female-specific system. Really fleshing out cancer pathways will take large scale interaction and functional screens -- which Weinberg specifically notes for dread the idea of such a "cancer cell wiring" project. Ironically, such a project is published in the same issue, the results of a genome-wide RNAi+imaging screen for genes relevant to cell division.
Which gets back to the root problem: if we view cancer funding as more-or-less a zero sum game, how much should we spend on cancer genomics and how much on investigator-focused functional efforts. That's not an easy question and I have no easy answer. It doesn't help that I don't even know the sums involved since I am not subject to the whims of grants (I have different capricious forces shaping my career!). But, clearly I would favor a sizable fraction (easily double digit percent) of cancer funding going to genomics projects.
One of the professors in my graduate department, who was actually no fan of genomics, said that a well-designed genetics experiment enables the cell to tell you what is important. Reading cancer genomes is precisely that, enabling us to discover what is truly important to cancer biology.
Thursday, April 08, 2010
Version Control Failure
Well, it was inevitable. A huge (and expensive) case of confused human genome versions.
While the human genome is 10 years old, that was hardly the final word. Lots of mopping up remained on the original project and perodically new versions would be released. The 2006 version, known as hg18, became very popular and is the default used by many sites. A later version (hg19) came out in March 2009 and are favored by other services, such as NCBI. UCSC supports all of them, but appears to have a richer set of annotation tracks for hg18. It isn't over yet: not only are there still gaps in the assembly (not to mention the centromeric badlands that are hardly represented), but with further investigation of structural variation across human populations it is likely that the reference will continue to evolve.
This is fraught with danger! Pairing an annotation track from one version with a different version results in very confusing results. Curiously, while the header line for the popular BED/BEDGRAPH formats has a number of optional fields, tagging them with version is not one of them. Software problems are one thing; doing experiments based on mismatched versions is another.
What came out in GenomeWeb's In Sequence (subscription required) is that the ABRF (a professional league of core facilities) had decided to study sequence capture methods and had chosen to test Nimblegen's & febit's array capture methods along with Agilent's in solution capture; various other technologies either weren't available or weren't quite up to their specs. I do wish ABRF had tested Agilent in both in solution and on array formats, as this would have been an interesting comparison.
What went south is that the design specification uploaded to Agilent used hg19 coordinates, but Agilent's design system (into a few days ago) uses hg18. So the wrong array was built and used to make the wrong in solution probes. So, when ABRF aligned the data, it was off. How much off depends on where you are on the chromosome: the farther down the chromosome the more likely it is to be off by a lot. ABRF apparently got a good amount of overlap, but there was an induced error.
I haven't yet found the actual study; I'm guessing it hasn't been released. If the GenomeWeb article is accurate, then it is in my opinion not kosher to grade the Agilent data according to the original experimental plan, since this wasn't followed. Either the Agilent data should be evaluated consistent to the actual design in its entirety OR the comparison of the platforms should be restricted to the regions that actually overlap between the actual Agilent design and the intended design.
In any case, I would like to put a plug in here that ABRF deposit the data in the Short Read Archive. Too few second generation sequencing datasets are ending up there, and targeted sequencing datasets would be particularly valuable from my viewpoint. Granted, a major issue is around confidentiality and donor's consents for their DNAs, which must be strictly observed. Personally, I believe if you don't deposit the dataset your paper should state this -- in other words, either way you must explicitly consider depositing and if you don't explain why it wasn't possible. The ideal of all data going into public archives was never quite perfect in the Sanger sequencing days, but we've slid far from that in the second generation world.
I had an extreme panic attack one day due to similar circumstances -- taking a quick first look at a gene of extreme interest in our first capture dataset it looked like the heavily captured region was shifted relative to my target gene -- and that my design had the same shift. Luckily, in my case it turned out to be all software -- I had mixed up versions but not in the actual design and so that part of the capture experiment was fine (I wish I could say the same about the results, but I can't really talk about them). I now make sure the genome version is part of the filename of all my BED/BEDGRAPH files to reduce the confusion and manually BLAT some key sequences to try While those are useful practices, I strongly believe that there should be a header tag which (if present) is checked on upload.
While the human genome is 10 years old, that was hardly the final word. Lots of mopping up remained on the original project and perodically new versions would be released. The 2006 version, known as hg18, became very popular and is the default used by many sites. A later version (hg19) came out in March 2009 and are favored by other services, such as NCBI. UCSC supports all of them, but appears to have a richer set of annotation tracks for hg18. It isn't over yet: not only are there still gaps in the assembly (not to mention the centromeric badlands that are hardly represented), but with further investigation of structural variation across human populations it is likely that the reference will continue to evolve.
This is fraught with danger! Pairing an annotation track from one version with a different version results in very confusing results. Curiously, while the header line for the popular BED/BEDGRAPH formats has a number of optional fields, tagging them with version is not one of them. Software problems are one thing; doing experiments based on mismatched versions is another.
What came out in GenomeWeb's In Sequence (subscription required) is that the ABRF (a professional league of core facilities) had decided to study sequence capture methods and had chosen to test Nimblegen's & febit's array capture methods along with Agilent's in solution capture; various other technologies either weren't available or weren't quite up to their specs. I do wish ABRF had tested Agilent in both in solution and on array formats, as this would have been an interesting comparison.
What went south is that the design specification uploaded to Agilent used hg19 coordinates, but Agilent's design system (into a few days ago) uses hg18. So the wrong array was built and used to make the wrong in solution probes. So, when ABRF aligned the data, it was off. How much off depends on where you are on the chromosome: the farther down the chromosome the more likely it is to be off by a lot. ABRF apparently got a good amount of overlap, but there was an induced error.
I haven't yet found the actual study; I'm guessing it hasn't been released. If the GenomeWeb article is accurate, then it is in my opinion not kosher to grade the Agilent data according to the original experimental plan, since this wasn't followed. Either the Agilent data should be evaluated consistent to the actual design in its entirety OR the comparison of the platforms should be restricted to the regions that actually overlap between the actual Agilent design and the intended design.
In any case, I would like to put a plug in here that ABRF deposit the data in the Short Read Archive. Too few second generation sequencing datasets are ending up there, and targeted sequencing datasets would be particularly valuable from my viewpoint. Granted, a major issue is around confidentiality and donor's consents for their DNAs, which must be strictly observed. Personally, I believe if you don't deposit the dataset your paper should state this -- in other words, either way you must explicitly consider depositing and if you don't explain why it wasn't possible. The ideal of all data going into public archives was never quite perfect in the Sanger sequencing days, but we've slid far from that in the second generation world.
I had an extreme panic attack one day due to similar circumstances -- taking a quick first look at a gene of extreme interest in our first capture dataset it looked like the heavily captured region was shifted relative to my target gene -- and that my design had the same shift. Luckily, in my case it turned out to be all software -- I had mixed up versions but not in the actual design and so that part of the capture experiment was fine (I wish I could say the same about the results, but I can't really talk about them). I now make sure the genome version is part of the filename of all my BED/BEDGRAPH files to reduce the confusion and manually BLAT some key sequences to try While those are useful practices, I strongly believe that there should be a header tag which (if present) is checked on upload.