Thursday, December 17, 2009

A Doublet of Solid Tumor Genomes

Nature this week published two papers describing the complete sequencing of a cancer cell line (small cell lung cancer (SCLC) NCI-H209 and melanoma COLO-829) each along with a "normal" cell line from the same individual. I'll confess a certain degree of disappointment at first as these papers are not rich in the information of greatest interest to me, but they have grown on me. Plus, it's rather churlish to complain when I have nothing comparable to offer myself.

Both papers have a good deal of similar structure, perhaps because their author lists share a lot of overlap, including the same first author. However, technically they are quite different. The melanoma sequencing used the Illumina GAII, generating 2x75 paired end reads supplemented with 50x2 paired end reads from 3-4Kb inserts, whereas the SCLC paper used 2x25 mate pair SOLiD libraries with inserts between 400 and 3000 bp.

The papers have estimates of the false positive and false negative rates for the detection of various mutations, in comparison to Sanger data. For single base pair substitutions on the Illumina platform in the melanoma sample, 88% of previously known variants were found and 97% of a sample of 470 newly found variants confirmed by Sanger. However, on small insertion/deletion (indel) there was both less data and much less success. Only one small deletion was previously known, a 2 base deletion which is key to the biology. This was not found by the automated alignment and analysis, though reads containing this indel could be found in the data. A sample of 182 small indels were checked by Sanger and only 36% were confirmed. On large rearrangements, 75% of those tested confirmed by PCR.

The statistics for the SOLiD data in SCLC were comparable. 76% of previously known single nucleotide variants were found and 97% of newly found variants confirmed by Sanger. Two small indels were previously known and neither was found and conversely only 25% of predicted indels confirmed by Sanger. 100% of large rearrangements tested by PCR validated. So overall, both platforms do well for detecting rearrangements and substitutions and are very weak for small indels.

The overall mutation hauls were large, after filtering out variants found in the normal cell line. 22,910 substitutions for the SCLC line and 33,345 in the melanoma line. Both of these samples reflect serious environmental abuse; melanomas often arise from sun exposure and the particular cancer morphology the SCLC line is derived from is characteristic of smokers (the smoking history of the patient was unknown). Both lines showed mutation spectra in agreement with what is previously known about these environmental insults. 92% of C>T single substitutions occured at the second base of a pyrimidne dimers (CC or CT sequences). CC>TT double substitutions were also skewed in this manner. CpG dinucleotides are also to be hotspots and showed elevated mutation frequencies. Transcription-coupled repair repairs the transcribed strand more efficiently than the non-transcribed strand, and in concordance with this in transcribed regions there was nearly a 2:1 bias of C>T changes on the non-transcribed strand. However, the authors state (but I still haven't quite figured out the logic) that transcription-coupled repair can account for only 1/3 of the bias and suggest that another mechanism, previously suspected but not characterized, is at work. One final consequence of transcription-coupled repair is that the more expressed a gene is in COLO-829, the lower its mutational burden. A bias of mutations towards the 3' end of transcribed regions was also observed, perhaps because 5' ends are transcribed at higher levels (due to abortive transcription). A transcribed-strand bias was also seen in G>T mutations, which may be oxidative damage.

An additional angle on mutations in the COLO-829 melanoma line is offered by the observation of copy-neutral loss of heterozygosity (LOH) in some regions. In other words, one copy of a chromosome was lost but then replaced by a duplicate of the remaining copy. This analysis is enabled by having the sequence of the normal DNA to identify germline heterozygosity. Interestingly, in these regions heterzyogous mutations outnumber homozygous ones, marking that these substitutions occurred after the reduplication event. 82% of C>T mutations in these regions show the hallmarks of being early mutations, suggesting they occured late, perhaps after the melanoma metastasized and was therefore removed from ultraviolet exposure.

In a similar manner, there is a rich amount of information in the SCLC mutational data. I'll skip over a bunch to hit the evidence for a novel transcription-coupled repair pathway that operates on both strands. The key point is that highly expressed genes had lower mutation rates on both strands than less expressed genes. A>G mutations showed a bias for the transcribed strand whereas G>A mutations occured equally on each strand.

Now, I'll confess I don't generally get excited about looking a mutation spectra. A lot of this has been published before, though these papers offer a particulary rich and low-bias look. What I'm most interested in are recurrent mutations and rearrangements that may be driving the cancer, particularly if they suggest therapeutic interventions. The melanoma line contained two missense mutations in the gene SPDEF, which has been associated with multiple solid tumors. A truncating stop mutation was found by sequencing SPDEF out of 48 additional tumors. A missense change was found in a metalloprotease (MMP28) which has previously been observed to be mutated in melanoma. Another missense mutation was found in agene which may play a role in ultraviolet repair (though it has been implicated in other processes), suggesting a tumor suppressor role. The sequencing results confirmed two out of three known driver mutations in COLO-829: the V600E activating mutation in kinase BRAF and deletion of the tumor suppressor PTEN. As noted above, the know 2 bp deletion in CDKN2A was not found through the automated process.

The SCLC sample has a few candidates for interestingly mutated genes. A fusion gene in which one partner (CREBBP) has been seen in leukemia gene fusions was found. An intragenic tandem duplication within the chromatin remodelling gene CHD7 was found which should generate an in-frame duplication of exons. Another SCLC cell line (NCI-H2171) was previously known to have a fusion gene involving CHD7. Screening of 63 other SCLC cell lines identified another (LU-135) with internal exon copy number alterations. Lu-135 was further explored by mate pair sequencing witha 3-4Kb library, which identified a breakpoint involving CHD7. Expression analysis showed high expression levels of CHD7 in both LU-135 and NCI-H2171 and a general higher expression of CHD7 in SCLC lines than non-small cell lung cancer lines and other tumor cell lines. An interesting twist is that the fusion partner in NCI-H2171 abd KY-135 is a non-coding RNA gene called PVT1 -- which is thought to be a transcriptional target of the oncogene MYC. MYC is amplified in both these cell lines, suggesting multiple biological mechanisms resulting in high expression of CHD7. It would seem reasonable to expect some high profile functional studies of CHD7 in the not too distant future.

For functional point mutations, the natural place to look is at coding regions and splice junctions, as here we have the strongest models for ranking the likelihood that a mutation will have a biological effect. In the SCLC paper an effort was made to push this a bit further and look for mutations that might affect transcription factor binding sites. One candidate was found but not further explored.

In general, this last point underlines what I believe will be different about subsequent papers. Looking mostly at a single cancer sample, one is limited at one can be inferred. The mutational spectrum work is something which a single tumor can illustrate in detail, and such in depth analyses will probably be significant parts of the first tumor sequencing paper for each tumor type, particularly other types with strong environmental or genetic mutational components. But, in terms of learnign what make cancers tick and how we can interfere with that, the real need is to find recurrent targets of mutation. Various cancer genome centers have been promising a few hundred tumors sequenced over the next year. Already at the recent ASH meeting (which I did not attend), there were over a half dozen presentations or posters on whole genome or exome sequencing of leukemias, lymphomas and myelomas -- the first ripples of the tsunami to come. But, the raw cost of targeted sequencing remains at most a 10th of the cost of an entire genome. The complete set of mutations found in either one of these papers could have been packed onto a single oligo based capture scheme and certainly a high-priority subset could be amplified by PCR without breaking the bank on oligos. I would expect that in the near future tumor sequencing papers will check their mutations and rearrangements on validation panels of at least 50 and preferable hundreds of samples (though assembling such sample collections is definitely not trivial). This will allow the estimation of the population frequency of those mutations which may recur at the level of 5-10% or more. With luck, some of those will suggest pharmacologic interventions which can be tested for their ability to improve patients' lives.
Pleasance, E., Stephens, P., O’Meara, S., McBride, D., Meynert, A., Jones, D., Lin, M., Beare, D., Lau, K., Greenman, C., Varela, I., Nik-Zainal, S., Davies, H., Ordoñez, G., Mudie, L., Latimer, C., Edkins, S., Stebbings, L., Chen, L., Jia, M., Leroy, C., Marshall, J., Menzies, A., Butler, A., Teague, J., Mangion, J., Sun, Y., McLaughlin, S., Peckham, H., Tsung, E., Costa, G., Lee, C., Minna, J., Gazdar, A., Birney, E., Rhodes, M., McKernan, K., Stratton, M., Futreal, P., & Campbell, P. (2009). A small-cell lung cancer genome with complex signatures of tobacco exposure Nature DOI: 10.1038/nature08629

Pleasance, E., Cheetham, R., Stephens, P., McBride, D., Humphray, S., Greenman, C., Varela, I., Lin, M., Ordóñez, G., Bignell, G., Ye, K., Alipaz, J., Bauer, M., Beare, D., Butler, A., Carter, R., Chen, L., Cox, A., Edkins, S., Kokko-Gonzales, P., Gormley, N., Grocock, R., Haudenschild, C., Hims, M., James, T., Jia, M., Kingsbury, Z., Leroy, C., Marshall, J., Menzies, A., Mudie, L., Ning, Z., Royce, T., Schulz-Trieglaff, O., Spiridou, A., Stebbings, L., Szajkowski, L., Teague, J., Williamson, D., Chin, L., Ross, M., Campbell, P., Bentley, D., Futreal, P., & Stratton, M. (2009). A comprehensive catalogue of somatic mutations from a human cancer genome Nature DOI: 10.1038/nature08658


Jeff said...

One caveat to the fine technical work is that both studies are on cell lines and not the tumor or mets themselves. We know there are differences. What mutations are the results of growth in culture? Must profile tissue samples. There are non-pathological techniques to deconvolute the heterogeneity and clinal nature of the tumor.

Keith Robison said...

Thanks - I meant to emphasize that point. The breast cancer sequencing published in October (Shah et al, which neither of the new papers cite, which is regrettable), did work with a clinical specimen.

Keith Robison said...
This comment has been removed by the author.
buka said...

Have you checked the supplement? :)