Friday, July 30, 2010

A huge scan through cancer genomes

Genentech and Affymetrix just published a huge paper in Nature using a novel technology to scan 4Mb in 441 tumor genomes for mutations, the largest number of tumor samples screened for many genes. Dan Koboldt over at MassGenomics has given a nice overview of the paper, but there are some bits I'd like to fill in as well. I'll blame some of my sloth in getting this out to the fact I was reading back through a chain of papers to really understand the core technique, but that's a weak excuse.

It's probably clear by now that I am a strong proponent (verging on cheerleader) for advanced sequencing technologies and their aggressive application, especially in cancer. The technology used here is intriguing, but it is in some ways a bit of a throwback. Now, on thinking that (and then saying it aloud) forces me to think about why I say that and perhaps this is a wave of the future, but I am skeptical -- but that doesn't detract from what they did here.

The technology, termed "mismatch repair detection", relies on some clever co-opting of the normal DNA repair mechanisms in E.coli. So clever is the co-opting, that the repair mechanisms are used to sometimes break a perfectly good gene!

The assay starts by designing PCR primers to generate roughly 200 bp amplicons. A reference library is generated from a normal genome and cloned into a special plasmid. This plasmid contains a functional copy of the Cre recombinase gene as well as the usual complement of gear in a cloning plasmid. This plasmid is grown in a host which does not Dam methylate its DNA, a modification in E.coli which marks old DNA to distinguish it from newly synthesized DNA.

The same primers are used to amplify target regions from the cancer genomes. These are cloned into a nearly identical vector, but with two significant differences. First, it has been propagated in a Dam+ E.coli strain; the plasmid will be fully methylated. Second, it also contains a Cre gene, but with a 5 nucleotide deletion which renders it inactive.

If you hybridize the test plasmids to the reference plasmids and then transform E.coli, one of two results occur. If there are no point mismatches, then pretty much nothing happens and Cre is expressed from the reference strand. The E.coli host contains an engineered cassette for resistance to one antibiotic (Tet) but sensitivity to another antibiotic (Str). With active Cre, this cassette is destroyed and the antibiotic resistance phenotype switched to Tet sensitivity and Str resistance.

However, the magic occurs if there is a single base mismatch. In this case, the methylated (test) strand is assumed to be the trustworthy one, and so the repair process eliminates the reference strand -- along with the functional allele of Cre. Without Cre activity, the cells remain resistant to Tet and sensitive to Str.

So, by splitting the transformation pool (all the amplicons from one sample transformed en masse) and selecting one half with Str and the other with Tet, plasmids are selected that either carry or lack a variant allele. Compare these two populations to a two-color resequencing array and you can identify the precise changes in the samples.

A significant limitation of the system is that it is really sensitive only for single base mismatches; any sort of indels or rearrangements are not detectable. The authors wave indels away ash "typically are a small proportion of somatic mutation", but of course they are a very critical type of mutation in cancer as they frequently are a means to knock out tumor suppressors. For large scale deletions or amplifications they use a medium density (244K) array, amusingly from Agilent. Mutation scanning was performed in both tumor tissue and matched normal, enabling the bioinformatic filtering of germline variants (though dbSNP was apparently used as an additional filter).

No cost estimates are given for the approach. Given the use of arrays, the floor can't be much below $500/sample or $1000/patient. The MRD system can probably be automated reasonably well but with a large investment in robots. Now, a comparable second generation approach (scanning about 4Mb) using any of the selection technologies would probably run $1000-$2000 per sample (2X that per patient), or perhaps 2-4X as much. So, if you were planning such an experiment you'd need to trade off your budget versus being blind to any sort of indels. The copy number arrays add expense but enable seeing big deletions and amplifications, though with sequencing the incremental cost of that information in a large study might be a few hundred dollars.

I think the main challenge to this approach is it is off the beaten path. Sequencing based methods are receiving so much investment that they will continue to push the price gap (whatever it is) closer. Perhaps the array step will be replaced with a sequencing assay, but the system both relies on and is hindered by the repair system's blindness to small indels. Sensitivity for the assay is benchmarked at 1%, which is quite good. Alas, no discussion was made of amplicon failure rates or regions of the genome which could not be accessed. Between high/low GC content and E.coli-unfriendly human sequences, there must have been some of this.

There is another expense which is not trivial. In order to scan the 4Mb of DNA, nearly 31K PCR amplicons were amplified out of each sample. This is a pretty herculean effort in itself. Alas, the Materials & Methods section is annoyingly (though not atypically) silent on the PCR approach. With correct automation, setting up that many PCRs is tedious but not undoable (though did they really make nearly 1K 384 well plates per sample??). But, conventional PCR quite often requires about 10ng of DNA per amplification, with a naive implication of nearly half a milligram of input DNA -- impossible without whole genome amplification, which is at best a necessary evil as it can introduce biases and errors. Second generation sequencing libraries can be built from perhaps 100ng-1ug of DNA, a significant advantage on this cost axis (though sometimes still a huge amount from a clinical tumor sample).

Now, perhaps one of the microfluidic PCR systems could be used, but if the hybridization of tester and reference DNAs requires low complexity pools, a technique such as RainDance isn't in the cards. My friend who sells the 48 sample by 48 amplicon PCR arrays would be in heaven if they adopted that technology to run these studies.

One plus of the study is a rigorous sample selection process. In addition to requiring 50% tumor content, every sample was reclassified by a board-certified pathologist and immunohistochemistry was used to ensure correct differentiation of the three different lung tumor types in the study (non-small cell adenocarcinoma, non-small cell squamous, and small cell carcinoma). Other staining was used to subclassify breast tumors by common criteria (HER2, estrogen receptor and progesterone receptor) and the prostate tumors were typed by an RT-PCR assay for a common (70+% of these samples!) driver fusion protein (TMPRSS2-ERG).

Also, it should be noted that they experimentally demonstrated a generic oncogenic phenotype (anchorage independent growth) upon transformation with mutants discovered in the study. That they could scan for so much and test so few is not an indictment of the paper, but a sobering reminder of how fast mutation finding is advancing and how slowly our ability to experimentally test those findings.
Kan Z, Jaiswal BS, Stinson J, Janakiraman V, Bhatt D, Stern HM, Yue P, Haverty PM, Bourgon R, Zheng J, Moorhead M, Chaudhuri S, Tomsho LP, Peters BA, Pujara K, Cordes S, Davis DP, Carlton VE, Yuan W, Li L, Wang W, Eigenbrot C, Kaminker JS, Eberhard DA, Waring P, Schuster SC, Modrusan Z, Zhang Z, Stokoe D, de Sauvage FJ, Faham M, & Seshagiri S (2010). Diverse somatic mutation patterns and pathway alterations in human cancers. Nature PMID: 20668451

Thursday, July 22, 2010

Salespeople, don't forget your props!

I had lunch today with a friend & former colleague who sells some cool genomics gadgets. One thing I've noted about him is whenever we meet he has a part of his system with him; it's striking how often this isn't the case (he's also been kind enough to leave me one).

Now, different systems have different sorts of gadgets with different levels of portability and attractiveness. The PacBio instrument is reputed to weigh in at one imperial ton, making it impractical for bringing along. Far too many folks are selling molecular biology reagents, which all come in the same sorts of Eppendorf tubes.

But on the other hand, there are plenty of cool parts that can be shown off. Flowcells for sequencers are amazing devices, which I've seen far too few times in the hands of salespersons. One of the Fluidigm microfluidic disposables is quite a conversation piece -- and the best illustration for how the technology works. The ABI 3730 sequencer's 96 capillary array was so striking I once took a picture of it -- or I thought I had until I looked through the camera files. The capillaries are coated with Kapton (or a similar polymer), giving them a dark amber appearance. They are delicate yet sturdy, bending over individually but in an organized fashion.

However, my most favorite memory of a gadget was an Illumina 96-pin bead array device. The beads are small enough that they produce all sorts of interesting optical effects -- 96 individually mounted opals!

Of course, those gadgets are not cheap. However, any well run manufacturing process is going to have failures, which is a good source for display units. Yes, if you have a really good process you won't generate any defective products, but given the rough state of the field I'm a bit suspicious of a process whose downstream QC never finds a problem. In any case, even if you can't afford to give them away at least the major trade show representatives should carry one. If a picture is worth a thousand words, then an actual physical object is worth exponentially more in persuasive ability.

One final thought. Given the rapidly changing nature of the business, many of these devices have very short lifetimes (in some cases because the company making them has a similarly short lifetime). I sincerely hope some museum is collecting examples of these, as they are important artifacts of today's technology. Plus, I really could imagine an art installation centered around some 3730 capillaries & Illumina bead arrays.

Wednesday, July 21, 2010

Distractions -- there's an app for that

Today I finally gave in to temptation & developed a Hello World application for my Droid. Okay, developed is a gross overstatement -- I successfully followed a recipe. But, it take a while to install the SDK & its plugin for the Eclipse environment plus the necessary device driver so I can debug stuff on my phone.

Since I purchased my Droid in November the idea of writing something for it has periodically tempted me. Indeed, one attraction of Scala (which I've done little with for weeks) was that it can be used to write Android apps, though it definitely means a new layer of comlexity. This week's caving in had two drivers.

First, Google last week announced a novice "you can write an app even if you can't program" tool called AppInventor. I rushed to try it out, only to find that they hadn't actually made it available but only a registration form. Supposedly they'll get back to you, but they haven't yet. Perhaps it's because I'm not an educator -- the form has lots of fields tilted at educators.

The second trigger is that an Android book I had requested came in at the library. Now, it's for a few versions back of the OS -- but certainly okay for a start (trying to keep public library collections current on technical stuff is a quixotic task in my opinion, though I do enjoy the fruits of the effort). So that was my train reading this mornign & it got me stoked. The book is certainly not much more than a starting springboard -- I'm debating buying one called "Advanced Android Programming" (or something close to that) or whether just to sponge off on-line resources.

The big question is what to do next. The general challenge is choosing between apps that don't do anything particularly sophisticated but are clearly doable vs. more interesting apps that might be a bit to take on -- especially given the challenge of operating a simulator for a device very unlike my laptop (accelerometers! GPS!). I have a bunch of ideas for silly games or demos, most of which shouldn't be too hard -- and then one concept that could be somewhat cool but also really pushing the envelope on difficulty.

It would be nice to come up with something practical for my work, but right now I haven't many ideas in that area. Given that most of the datasets I work with now are enormous, it's hard to see any point to trying to access them via phone. A tiny browser for the UCSC genome database has some appeal, but that's sounding a bit ambitious.

If I were still back at Codon Devices, I could definitely see some app opportunities, either to demo "tech cred" or really useful. For example, at one point we were developing (though an outsource vendor) a drag-and-drop gene design interface. The full version probably wouldn't be very app appropriate, but something along those lines could be envisioned -- call up any protein out of Entrez & have it codon optimized with appopropriate constraints & sent to the quoting system. In our terminal phase, it would have been very handy to have a phone app to browse metabolic databases such as KEGG or BioCyc.

That thought has suggested what I would develop if I were back in school. There is a certain amount of simple rote memorization that is either demanded or turns out to expedite later studies. For example, I really do feel you need to memorize the single letter IUPAC codes for nucleotides and amino acids. I remember having to memorize amino acid structures and the Krebs cycle and glycolysis and all sorts of organic synthesis reactions and so forth. I often devised either decks of flash cards or study sheets, which I would look at while standing in line for the cafeteria or other bits of solitary time. Some of those decks were a bit sophisticated -- for the pathways I remember making both compound-centric and reaction-centric cards for the same pathways. That sort of flashcard app could be quite valuable -- and perhaps even profitable if you could get students to try it out. I can't quite see myself committing to such a business, even as a side-line, so I'm okay with suggesting it here.

Tuesday, July 13, 2010

There are 2 styles of Excel reports: Mine & Wrong

A key discovery which is made by many programmers, both inside and outside bioinformatics, is that Microsoft Excel is very useful as a general framework for reporting to users. Unfortunately, many developers don't get beyond that discovery to think about how to use this to best advantage. I've developed some pretty strong opinions on this, which have been repeatedly tested recently by various files I've been sent. I've also used this mechanism repeatedly, with some Codon reports for which I am guilty of excessive pride.

An overriding principle for me is that I am probably going to use any report in Excel as a starting point for further analysis, not an endpoint. I'm going to do further work in Excel or import it into Spotfire (my preference) or JMP or R or another fine tool. Unfortunately, there are a lot of practices which frustrate this.

First, as much data as possible should be packed into as few tabs as practical. Unless you have a very good reason, don't put data formatted the same way into multiple files or multiple tabs. I recently got some sequencing results from a vendor and there was one file per amplicon per sample. I want one file per total project!

Second, the column headers need to be ready for import. That means a single row of column headers and every column has a specific and unique header. Yes, for viewing it sometimes looks better to have multiple rows and use cell fusing and other tricks to minimize repetition -- but for import this is a disaster either guaranteed or likely to happen.

Third, every row needs to tell as complete a story as possible. Again, don't go fusing cells! It looks good, but nobody downstream can tell that the second row really repeats the first N cells of the row above (because they are fused).

Fourth, don't worry about extra rows. One tool I use for analysis of Sanger data spits out a single row per sample with N columns, one column for each mutation. This is not a good format! Similarly, think very carefully before packing a lot into a single cell -- Excel is terrible for parsing that back out. Don't be afraid to create lots of columns & rows -- Excel is much better at hiding, filtering or consolidating than it is at parsing or expanding.

Finally, color or font coding can be useful -- but use it carefully and generally redundantly. Ignoring the careful part means generating confusing "angry fruit salad" displays (and never EVER make text blink in a report or slide!!!).

Follow these simple rules and you can make reports which are springboards for further exploration. It's also a good start to thinking about using Excel as a simple front end to SQL databases.

So what was so great about my Codon reports? Well, I had figured out how to generate the XML to handle a lot of nice features of the sort I've discussed above. The report had multiple tabs, each giving a different view or summary of the data. The top tab did break my rules -- it was a purely summary table & was not formatted for input into other tools (though now I'm feeling guilty about that; perhaps I should have had another tab with it properly formatted). But each additional tab stuck to the rules. All of them had AutoFilter already turned on and had carefully chosen highlighting when useful -- using a combination of cell color and text highlighting to emphasize key cells. Furthermore, it also hewed to my absolute dictum "Sequences must always be in a fixed width font!". I didn't have it automatically generate Pivot Tables; perhaps eventually I would have gotten there.