Omics! Omics!: A huge scan through cancer genomes

Genentech and Affymetrix just published a huge paper in Nature using a novel technology to scan 4Mb in 441 tumor genomes for mutations, the largest number of tumor samples screened for many genes. Dan Koboldt over at MassGenomics has given a nice overview of the paper, but there are some bits I'd like to fill in as well. I'll blame some of my sloth in getting this out to the fact I was reading back through a chain of papers to really understand the core technique, but that's a weak excuse.

It's probably clear by now that I am a strong proponent (verging on cheerleader) for advanced sequencing technologies and their aggressive application, especially in cancer. The technology used here is intriguing, but it is in some ways a bit of a throwback. Now, on thinking that (and then saying it aloud) forces me to think about why I say that and perhaps this is a wave of the future, but I am skeptical -- but that doesn't detract from what they did here.

The technology, termed "mismatch repair detection", relies on some clever co-opting of the normal DNA repair mechanisms in E.coli. So clever is the co-opting, that the repair mechanisms are used to sometimes break a perfectly good gene!

The assay starts by designing PCR primers to generate roughly 200 bp amplicons. A reference library is generated from a normal genome and cloned into a special plasmid. This plasmid contains a functional copy of the Cre recombinase gene as well as the usual complement of gear in a cloning plasmid. This plasmid is grown in a host which does not Dam methylate its DNA, a modification in E.coli which marks old DNA to distinguish it from newly synthesized DNA.

The same primers are used to amplify target regions from the cancer genomes. These are cloned into a nearly identical vector, but with two significant differences. First, it has been propagated in a Dam+ E.coli strain; the plasmid will be fully methylated. Second, it also contains a Cre gene, but with a 5 nucleotide deletion which renders it inactive.

If you hybridize the test plasmids to the reference plasmids and then transform E.coli, one of two results occur. If there are no point mismatches, then pretty much nothing happens and Cre is expressed from the reference strand. The E.coli host contains an engineered cassette for resistance to one antibiotic (Tet) but sensitivity to another antibiotic (Str). With active Cre, this cassette is destroyed and the antibiotic resistance phenotype switched to Tet sensitivity and Str resistance.

However, the magic occurs if there is a single base mismatch. In this case, the methylated (test) strand is assumed to be the trustworthy one, and so the repair process eliminates the reference strand -- along with the functional allele of Cre. Without Cre activity, the cells remain resistant to Tet and sensitive to Str.

So, by splitting the transformation pool (all the amplicons from one sample transformed en masse) and selecting one half with Str and the other with Tet, plasmids are selected that either carry or lack a variant allele. Compare these two populations to a two-color resequencing array and you can identify the precise changes in the samples.

A significant limitation of the system is that it is really sensitive only for single base mismatches; any sort of indels or rearrangements are not detectable. The authors wave indels away ash "typically are a small proportion of somatic mutation", but of course they are a very critical type of mutation in cancer as they frequently are a means to knock out tumor suppressors. For large scale deletions or amplifications they use a medium density (244K) array, amusingly from Agilent. Mutation scanning was performed in both tumor tissue and matched normal, enabling the bioinformatic filtering of germline variants (though dbSNP was apparently used as an additional filter).

No cost estimates are given for the approach. Given the use of arrays, the floor can't be much below $500/sample or $1000/patient. The MRD system can probably be automated reasonably well but with a large investment in robots. Now, a comparable second generation approach (scanning about 4Mb) using any of the selection technologies would probably run $1000-$2000 per sample (2X that per patient), or perhaps 2-4X as much. So, if you were planning such an experiment you'd need to trade off your budget versus being blind to any sort of indels. The copy number arrays add expense but enable seeing big deletions and amplifications, though with sequencing the incremental cost of that information in a large study might be a few hundred dollars.

I think the main challenge to this approach is it is off the beaten path. Sequencing based methods are receiving so much investment that they will continue to push the price gap (whatever it is) closer. Perhaps the array step will be replaced with a sequencing assay, but the system both relies on and is hindered by the repair system's blindness to small indels. Sensitivity for the assay is benchmarked at 1%, which is quite good. Alas, no discussion was made of amplicon failure rates or regions of the genome which could not be accessed. Between high/low GC content and E.coli-unfriendly human sequences, there must have been some of this.

There is another expense which is not trivial. In order to scan the 4Mb of DNA, nearly 31K PCR amplicons were amplified out of each sample. This is a pretty herculean effort in itself. Alas, the Materials & Methods section is annoyingly (though not atypically) silent on the PCR approach. With correct automation, setting up that many PCRs is tedious but not undoable (though did they really make nearly 1K 384 well plates per sample??). But, conventional PCR quite often requires about 10ng of DNA per amplification, with a naive implication of nearly half a milligram of input DNA -- impossible without whole genome amplification, which is at best a necessary evil as it can introduce biases and errors. Second generation sequencing libraries can be built from perhaps 100ng-1ug of DNA, a significant advantage on this cost axis (though sometimes still a huge amount from a clinical tumor sample).

Now, perhaps one of the microfluidic PCR systems could be used, but if the hybridization of tester and reference DNAs requires low complexity pools, a technique such as RainDance isn't in the cards. My friend who sells the 48 sample by 48 amplicon PCR arrays would be in heaven if they adopted that technology to run these studies.

One plus of the study is a rigorous sample selection process. In addition to requiring 50% tumor content, every sample was reclassified by a board-certified pathologist and immunohistochemistry was used to ensure correct differentiation of the three different lung tumor types in the study (non-small cell adenocarcinoma, non-small cell squamous, and small cell carcinoma). Other staining was used to subclassify breast tumors by common criteria (HER2, estrogen receptor and progesterone receptor) and the prostate tumors were typed by an RT-PCR assay for a common (70+% of these samples!) driver fusion protein (TMPRSS2-ERG).

Also, it should be noted that they experimentally demonstrated a generic oncogenic phenotype (anchorage independent growth) upon transformation with mutants discovered in the study. That they could scan for so much and test so few is not an indictment of the paper, but a sobering reminder of how fast mutation finding is advancing and how slowly our ability to experimentally test those findings.

Kan Z, Jaiswal BS, Stinson J, Janakiraman V, Bhatt D, Stern HM, Yue P, Haverty PM, Bourgon R, Zheng J, Moorhead M, Chaudhuri S, Tomsho LP, Peters BA, Pujara K, Cordes S, Davis DP, Carlton VE, Yuan W, Li L, Wang W, Eigenbrot C, Kaminker JS, Eberhard DA, Waring P, Schuster SC, Modrusan Z, Zhang Z, Stokoe D, de Sauvage FJ, Faham M, & Seshagiri S (2010). Diverse somatic mutation patterns and pathway alterations in human cancers. Nature PMID: 20668451

Omics! Omics!

Friday, July 30, 2010

A huge scan through cancer genomes

1 comment: