Thursday, October 29, 2009

My Most Expensive Paper

Genome Research has a paper detailing the Mammalian Gene Collection (MGC), and if you look way down on the long author list (which includes Francis Collins!) you'll see mine there along with two Codon Devices colleagues. This paper cost me a lot -- nothing in legal tender, but a heck of a lot of blood, sweat & tears.

The MGC is an attempt to have every human & mouse protein coding sequence (plus more than a few rat)available as an expression clone, with native sequence. Most of the genes were cloned from cDNA libraries, but coding sequences which couldn't be found that way were farmed out to a number of synthetic biology companies. Codon decided to take on a particularly challenging tranche of mostly really long ORFs, hoping to demonstrate our proficiency in this difficult task.

At the start, the attitude was "can-do". When it appeared we couldn't parse some targets into our construction scheme, I devised a new algorithm that captured a few more (which I blogged about cryptically). It was going to be a huge order which would fill our production pipeline in a expansive new facility we had recently moved into, replacing a charming but cramped historic structure. A new system for tracking constructs through the facility was about to be rolled out that would let us finally track progress across the pipeline without a human manager constantly looking over each plasmid's shoulder. The delivery schedule for MGC was going to be aggressive but would show our chops. We were going to conquer the world!

Alas, almost as soon as we started (and had sunk huge amounts of cash into oligos) we discovered ourselves in a small wicker container which was growing very hot. Suddenly, nothing was working in the production facility. A combination of problems, some related to the move (a key instrument incorrectly recalibrated)and another problem whose source was never quite nailed down forced a complete halt to all production activity for several months -- which soon meant that MGC was going to be the only trusty source of revenue -- if we could get MGC to release us from our now utterly undoable delivery schedule.

Eventually, we fixed the old problems & got new processes in place and pushed a bunch of production forward. We delivered a decent first chunk of constructs to MGC, demonstrating that we were for real (but still with much to deliver). Personnel were swiped from the other piece of the business (protein engineering) to push work forward. More and more staff came in on weekends to keep things constantly moving.

Even so, trouble still was a constant theme. Most of the MGC project were large constructs, which were built by a hierarchical strategy. Which means the first key task was to build all the parts -- and some parts just didn't want to be built. We had two processes for building "leaves", and both underwent major revisions and on-the-fly process testing. We also started screening more and more plasmids by sequencing, sometimes catching a single correct clone in a mountain of botched ones (but running up a higher and higher capillary sequencing bill). Sometimes we'd get almost right pieces, which could be fixed by site directed mutagenesis -- yet another unplanned cost in reagents & skilled labor. I experimented with partial redesigns of some builds -- but with the constraint of not ordering more costly oligos. Each of these pulled in a few more constructs, a few more delivered -- and a frustrating pile of still unbuilt targets.

Even when we had all the parts built, the assembly of them to the next stage was failing at alarming rates -- usually by being almost right. Yet more redesigns requiring fast dancing by the informatics staff to support. More constructs pushed through. More weekend shifts.

In the end, when Codon shut down its gene synthesis business -- about 10 months after starting the MGC project -- we delivered a large fraction of our assignment -- but not all of it. For a few constructs we delivered partial sequences for partial credit. It felt good to deliver -- and awful to not deliver.

Now, given all that I've described (and more I've left out), I can't help but be a bit guilty about that author list. It was decided at some higher level that the author list would not be several miles long, and so some sort of cut had to be made. Easily 50 Codon employees played some role in the project, and certainly there were more than a dozen for whom it occupied a majority of their attention. An argument could have been easily made for at least that many Codon authors. But, the decision was made that the three of us who had most shared the project management aspect would go on the paper. In my case, I had ended up the main traffic cop, deciding which pieces needed to be tried again through the main pipeline and which should be directed to the scientist with magic hands. For me, authorship is a small token for the many nights I ran SQL queries at midnight to find out what had succeeded and what had failed in sequencing -- and then checked again at 6 in the morning before heading off to work. Even on weekends, I'd be hitting the database in the morning & night to find out what needed redirecting -- and then using SQL inserts to redirect them. I realized I was on the brink of madness when I was sneaking in queries on family ski weekend.

Perhaps after such a checkered experience it is natural to question the whole endeavor. The MGC effort means that researchers who want to express a mammalian protein from a native coding sequence can do so. But how much of what we built will actually get used? Was it really necessary to build the native coding sequence -- which often gave us headaches in the builds from repeats & GC-rich regions (or, as we belatedly discovered, certain short runs of G could foul us up)? MGC is a great resource, but the goal of a complete catalog of mammalian genes wasn't realized -- some genes still aren't available from MGC or any of the commercial human gene collections.

MGC also torture-tested Codon's construction processes, and the original ones failed badly. Our in-progress revisions fared much better, but still did not succeed as frequently as they should have. when we could troubleshoot things, we could ascribe certain failures to almost every conceivable source -- bad enzymes, a bad oligo well, failure to follow procedures, laboratory mix-ups, etc. But an awful lot could not be pinned to any cause, despite investigation, suggesting that we simply did not understand our system well enough to use it in a high-throughput production environment.

I do know one thing: while I hope to stay where I am for a very long time, should I ever be looking for a job again I will avoid a production facility. Some gene synthesis projects were worse than MGC in terms of demanding customers with tight timelines (which is no knock on the customers; now I'm that customer!), but even with MGC I found it's just not the right match for me. It's no fun to burn so much effort on just getting something through the system so that somebody else can do the cool biology. I don't ever want to be in a situation where I'm on vacation and thinking about which things are stalled in the line. Some people thrive in the environment; I found it draining.

But, there is something to be said for the experience. I learned a lot which can be transferred to other settings. That which doesn't kill us makes us stronger -- MGC must have made me Superman.

Monday, October 26, 2009


Curiosity question: do the current DTC genomics companies report out copy number variations (CNVs) to their customers? Are any of their technologies unable to read these? Clearly Knome (or Illumina, which isn't DTC but sort of competing with them) should be able to get this info from the shotgun sequencing. But what about the array-based companies such as Navigenics & 23andMe? My impression is that any high density SNP array data can be mined for copy number info, but perhaps there are caveats or restrictions on that.

It would seem that with CNVs so hot in the literature and a number of complex diseases being associated to them, this would be something the DTC companies would jump at. But have they?

Saturday, October 24, 2009

Now where did I misplace that genome segment of mine?

One of the many interesting ASHG tidbits from the Twitter feed is a comment from "suganthibala" which I'll quote in full
On average we each are missing 123 kb. homozygously. An incomplete genome is the norm. What a goofy species we are.

I'm horribly remiss in tracking the CNV literature, but this comment makes me wonder whether this is atypical at all. How extensively has this been profiled in other vertebrate species and how do other species look in terms of the typical amount of genome missing? I found two papers for dogs, one of which features a former lab mate as senior author and the other one has Evan Eichler in the author list. Some work has clearly been done in mouse as well.

Presumably there is some data for Drosophila, but how extensive? Are folks going through their collections of D. melanogaster collected from all of the world and looking for structural variation? With a second gen sequencer, this would be straightforward to do -- though a lot of libraries would need to be prepped! Many flies could be packed into one lane of Illumina data, so this would take some barcoding. Even cheaper might be to do it on a Polonator (reputed to cost about $500 in consumables per run (not including library prep).

Attacking this by paired-end/mate-pair NGS rather than arrays (which have been the workhorse so far) would enable detecting balanced rearrangements, which arrays are blind to (though there is another tweeted item that Eichler states "Folks you can't get this kind of information from nextgen sequencing; you need old-fashioned capillaries" -- I'd love to hear the background on that) That leads to another proto-thought: will the study of structural variation lead to better resolution of the conundrum of speciation and changes in chromosome structure -- i.e. it's easy to see how such rearrangments could lead to reproductive isolation but not easy to see how they wouldn't be sufficiently non-isolating to allow for enough founders.

ASHG Tweets: Minor Fix or Slow Torture?

Okay, I'll admit it: I've been ignoring Twitter. It doesn't help that I never really learned to text (I might have sent one in my life). Maybe if I ever get a phone with a real keyboard, but even then I'm not sure. Live blogging from meetings seemed a bit interesting -- but in those tiny packets? I even came up with a great post on Twitter -- alas a few days after the first of April, when it would have been appropriate.

But now I've gotten myself hooked on the Twitter feed coming from attendees at the American Society for Human Genetics. It's an interesting mix -- some well established bloggers, lots of folks I don't know plus various vendors hawking their booths or off-conference tours and such. Plus, you don't even need a Twitter account!

The only real problem is its really making me wish I was there. I've never been to Hawaii, despite a nearly lifelong interest in going. And such a cool meeting! But, you can't go to every meeting unless your a journalist or event organizer (or sales rep!), so I had to stay home and get work done.

I suspect I'm hooked & will be repeating this exercise whenever I miss good conferences. Who knows? Maybe I'll catch the Twitter bug yet!

Thursday, October 22, 2009

Physical Maps IV: Twilight of the Clones?

I've been completely slacking on completing my self-imposed series on how second generation sequencing (I'm finally trying to kick the "next gen" term) might reshape the physical mapping of genomes. It hasn't been that my brain has been ignoring the topic, but somehow I've not extracted the thoughts through my fingertips. And I've figured out part of the reason for my reticence -- my next installment was supposed to cover BACs and other clone-based maps, and I'm increasingly thinking these aren't going to be around much longer.

Amongst the many ideas I turned over was how to adapt BACs to the second generation world. BACs are very large segments -- often a few hundred kilobases -- cloned into low copy (generally single copy) vectors in E.coli.

One approach would be to simple sequence the BACs. One key challenge is that a single BAC is poorly matched to a second generation sequencer; even a single lane of a sequencer is gross overkill. So good high-throughput multiplex library methods are needed. Even so, there will be a pretty constant tax of resequencing the BAC vector and the inevitable contaminating host DNA in the prep. That's probably going to run about 10% wastage -- not unbearable but certainly not pretty.

Another type of approach is end-sequencing. for this you really need long reads, so 454 is probably the only second generation machine suitable. But, you need to smash down the BAC clone to something suitable for emulsion PCR. I did see something in Biotechniques on a vectorette PCR to accomplish this, so it may be a semi-solved problem.

A complementary approach is to landmark the BACs, that is to identify a set of distinctive features which can be used to determine which BACs overlap. At the Providence conference one of the posters discussed getting 454 reads from defined restriction sites within a BAC.

But, any of these approaches still require picking the individual BACs and prepping DNA from them and performing these reactions. While converting to 454 might reduce the bill for the sequence generation, all that picking & prepping is still going to be expensive.

BACs baby cousins are fosmids, which are essentially the same vector concept but designed to be packaged into lambda phage. Fosmids carry approximately 40Kb of DNA. I've already seen ads from Roche/454 claiming that their 20Kb mate pair libraries obviate the need for fosmids. While 20Kb is only half the span, many issues that fosmids solve are short enough to be fixed by a 20Kb span, and the 454 approach enables getting lots of them.

This is all well and good, but perhaps its time to look just a little bit further ahead. Third generation technologies are getting close to reality (those who have early access Pacific Biosciences machines might claim they are reality). Some of the nanopore systems detailed in Rhode Island are clearly far away from being able to generate sequences you would believe. However, physical mapping is a much less demanding application than trying to generate a consensus sequence or identify variants. Plenty of times in my early career it was possible using BLAST to take amazingly awful EST sequences and successfully map them against known cDNAs.

Now, I don't have any inside information on any third generation systems. But, I'm pretty sure I saw a claim that Pacific Biosciences has gotten reads close to 20Kb. Now, this could have been a "magic read" where all the stars were aligned. But imagine for a moment if this technology can routinely hit such lengths (or even longer) -- albeit with quality that makes it unusable for true sequencing but sufficient for aligning to islands of sequence in a genome assembly. If such a technology could generate sufficient numbers of such reads in reasonable time, the 454 20Kb paired libraries could start looking like buggy whips.

Taking this logic even further, suppose one of the nanopore technologies could really scan very long DNAs, perhaps 100Kb or more. Perhaps the quality is terrible, but again, as long as its just good enough. For example, suppose the error rate was 15%, or a phred 8 score. AWFUL! But, in a sequence of 10,000 (standing for the size of a fair-sized sequence island in an assembly) you'd expect to find nearly 3 runs of 50 correct bases. Clearly some clever algorithmics would be required (especially since with nanopores you don't know which direction the DNA is traversing the pore), but this would suggest that some pretty rotten sequencing could be used to order sequence islands along long reads.

Yet another variant on this line of thinking would be to use nanopores to read defined sequence landmarks from very long fragments. Once you have an initial assembly, a set of unique sequences can be selected for synthesis on microarrays. While PCR is required to amplify those oligos, it also offers an opportunity to subdivide the huge pool. Furthermore, with sufficiently long oligos on the chip one could even have multiple universal primer targets per oligo, enabling a given landmark to be easily placed in multiple orthogonal pools. With an optical nanopore reading strategy, 4 or more color-coded pools could be hybridized simultaneously and read. Multiple colors might be used for more elaborate coding of sequence islands -- i.e. one island might be encoded with a series of flashing lights, much like some lighthouses. Again, clever algorithmics would be needed to design such probe strategies.

How far away would such ideas be? Someone more knowledgeable about the particular technologies could guess better than I could. But, it would certainly be worth exploring, at least on paper, for anyone wanting to show that nanopores are close to prime time. While really low quality reads or just landmarking molecules might not seem exciting, it would offer a chance to get the technology into routine operation -- and from such routine operation comes continuous improvement. In other words, the way to push nanopores into routine sequencing might be by carefully picking something other than sequence -- but making sure that it is a path to sequencing and not a detour.

Wednesday, October 14, 2009

Why I'm Not Crazy About The Term "Exome Sequencing"

I find myself worrying sometimes that I worry too much about the words I use -- and worry some of the rest of the time that I don't worry enough. What can seem like the right words at one time might seem wrong some other time. The terms "killer app" are thrown around a lot in the tech space, but would you really want to hear it used about sequencing a genome if you were the patient whose DNA was under scrutiny?

One term that sees a lot of traction these days is "exome sequencing". I listened in on a free Science magazine webinar today on the topic, and the presentations were all worthwhile. The focus was on the Nimblegen capture technology (Roche/Nimblegen/454 sponsored the webinar), though other technologies were touched on.

By "exome sequencing" what is generally meant is to capture & sequence the exons in the human genome in order to find variants of interest. Exons have the advantage of being much more interpretable than non-coding sequences; we have some degree of theory (though quite incomplete) which enables prioritizing these variants. The approach also has the advantage of being significantly cheaper at the moment than whole genome sequencing (one speaker estimated $20K per exome). So what's the problem?

My concern is that the terms "exome sequencing" are taken a bit too literally. Now, it is true that these approaches catch a bit of surrounding DNA due to library construction and the targeting approaches cover splice junctions, but what about some of the other important sequences? According to my poll of practitioners of this art, their targets are entirely exons (confession: N=1 for the poll).

I don't have a general theory for analyzing non-coding variants, but conversely there are quite a few well annotated non-coding regions of functional significance. An obvious case are promoters. Annotation of human promoters and enhancers and other transcriptional doodads is an ongoing process, but some have been well characterized. In particular, the promoters for many drug metabolizing enzymes have been scrutinized because these may have significant effects on how much of the enzyme is synthesized and therefore drug metabolism.

Partly coloring my concern is the fact that exome sequencing kits are becoming standardized; at least two are on the market currently. Hence, the design shortcomings of today might influence a lot of studies. Clearly sequencing every last candidate promoter or enhancer would tend to defeat the advantages of exome sequencing, but I believe a reasonable shortlist of important elements could be rapidly identified.

My own professional interest area, cancer genomics, adds some additional twists. At least one major cancer genome effort (at the Broad) is using exome sequencing. On the one hand, it is true that there are relatively few recurrent, focused non-coding alterations documented in cancer. However, few is not none. For example, in lung cancer the c-Met oncogene has been documented to be activated by mutations within an intron; these mutations cause skipping of an exon encoding an inhibitory domain. Some of these alterations are about 50 nucleotides away from the nearest splice junction -- a distance that is likely to result in low or no coverage using the Broad's in solution capture technology (confession #2: I haven't verified this with data from that system).

The drug metabolizing enzyme promoters I mentioned before are a bit greyer for cancer genomics. On the one hand, one is generally primarily interested in what somatic mutations have occurred on the tumor. On the other hand, the norm in cancer genomics is tending towards applying the same approach to normal (cheek swab or lymphocyte) DNA from the patient, and why not get the DME promoters too? After all, these variants may have influenced the activity of therapeutic agents or even development of the disease. Just as some somatic mutations seem to cluster enigmatically with patient characteristics, perhaps some somatic mutations will correlate with germline variants which contributed to disease initiation.

Whatever my worries, they should be time-limited. Exome sequencing products will be under extreme pricing pressure from whole genome sequencing. The $20K cited (probably using 454 sequencing) is already potentially matched by one vendor (Complete Genomics). Now, in general the cost of capture will probably be a relatively small contributor compared to the cost of data generation, so exome sequencing will ride much of the same cost curve as the rest of the industry. But, it probably is $1-3K for whole exome capture due to the multiple chips required and the labor investment (anyone have a better estimate?). If whole mammalian genome sequencing really can be pushed down into the $5K range, then mammalian exome sequencing will not offer a huge cost advantage if any. I'd guess interest in mammalian exome sequencing will peak in a year or two, so maybe I should stop worrying and learn to love the hyb.

Friday, October 09, 2009

Bad blog! Bad, bad, bad blog!

Thanks to Dan Koboldt from Mass Genomics, I've discovered that another blog (the Oregon Personal Injury Law Blog had copied my breast cancer genome piece. Actually, it appears that since it started this summer it may have copied every one of my posts here at Omics! Omics! without any attribution or apparent linking back. I've left a comment (which is moderated) protesting this.

curiously, the author of this blog (I assume it has one) doesn't seem to have left any identifying information or contact info, so for the moment the comments section is my only way of communicating. Perhaps this is some sort of wierd RSS-driven bug; that's the only charitable explanation I can contemplate. But it is strange -- most of these have no possible link to personal injury -- or can PNAS sue me for complaining about their RSS feed?

We'll see if the author fixes this, or at least replies with something along the lines of "head down, ears flat & tail between the legs".

Just to double-check the RSS hypothesis, I'm actually going to explicitly sign this one -- Keith Robison from Omics! Omics!.

Nano Anglerfish Snag Orphan Enzymes

The new Science has an extremely impressive paper tackling the problem of orphan enzymes. Due primarily to Watson-Crick basepairing, our ability to sequence nucleic acids has shot far past our ability to characterize the proteins they may encode. If I want to measure an RNA's expression, I can generate an assay almost overnight by designing specific real-time PCR (aka RT-PCR aka TaqMan) probes. If I want to analyze any specific protein's expression, it generally involves a lot of teeth gnashing & frustration. If you're lucky, there is a good antibody for it -- but most times there is either no antibody or one of unknown (and probably poor) character. Mass spec based methods continue to improve, but still don't have an "analyze any protein in any biological sample anytime" character (yet?).

One result of this is that there are a lot of ORFs of unknown function in any sequenced genome. Bioinformatic approaches can make guesses for many of these and those guesses are often around enzymatic activity, but a bioinformatic prediction is not proof and the predictions are often quite vague (such as "hydrolase"). Structural genomics efforts sometimes pull in additional proteins whose sequence didn't resemble anything of known function, but whose structure has enzymatic characteristics such as nucleotide binding pockets. There have been one or two of such structures de-orphaned by virtual screening, but these are a rarity.

Attempts have been made at high-throughput screening of enzyme activities. For example, several efforts have been published in which cloned libraries of proteins from a proteome were screened for enzyme activity. While these produced initial papers, they've never seemed to really catch fire.

The new paper is audacious in providing an approach to detecting enzyme activities and subsequently identifying the responsible proteins, all from protein extracts. The key trick is an array of golden nano anglerfish -- well, that's how I imagine it. Like an anglerfish, the gold nanoparticles dangle their chemical baits off long spacers (poly-A, of all things!). In reverse of an anglerfish, the bait complex glows after it has been taken by its prey, with a clever unquenching mechanism activating the fluorophore and marking that a reaction took place. But the real kicker is that like an anglerfish, the nanoparticles seize their prey! Some clever chemistry around a bound Cobalt ion (which I won't claim to understand)results in linking the enzyme to the nanoparticle, from which it can be cleaved, trypsinized and identified by mass spectrometry. 1676 known metabolites and 807 other compounds of interest were immobilized in this fashion.

As one test, the researchers applied separately extracts of the bacteria Pseudomonas putida and Streptomyces coelicolor to arrays. Results were in quite strong agreement with the existing bioinformatic annotations of these organisms, in that the P.putida extract's pattern of metabolized and not metabolized substrates strongly coincided with what the informatics would predict and the same was true for S.coelicolor (with a P<5.77^-177 for the latter!). But, agreement was not perfect -- each species catalyzed additional reactions on the array which were absent from the databases. By identifying the bound proteins, numerous assignments were made which were either novel or significant refinements of the prior annotation. Out of 191 proteins identified in the P.putida set, 31 hypothetical proteins were assigned function, 47 proteins were assigned a different function and the previously ascribed function was confirmed for the remaining 113 proteins.

Further work was done with environmental samples. However, given the low protein abundance from such samples, these were converted into libraries cloned into E.coli and then the extracts from these E.coli strains analyzed. Untransformed E.coli was used to estimate the backgrounds to subtract -- I must confess a certain disappointment that the paper doesn't report any novel activities for E.coli, though it isn't clear that they checked for them (but how could you not!). The samples came from three extreme environments -- one from a hot, heavy metal rich acidic pool, one from oil-contaminated seawater and a third from a deep sea hypersaline anoxic region. From each sample a plethora of enzyme activities were discovered.

Of course, there are limits to this approach. The tethering mechanism may interfere with some enzymes acting on their substrates. It may, therefore, be desirable to place some compounds multiple times on the array but with the linker attached at different points. It is unlikely we know all possible metabolites (particularly for strange bugs from strange places), so some enzymes can't be deorphaned this way. And sensitivity issues may challenge finding some enzyme activities if very few copies of the enzyme are present.

On the other hand, as long as these issues are kept in mind this is an unprecedented & amazing haul of enzyme annotations. Application of this method to industrially important fungi & yeasts is another important area, and certainly only the bare surface of the bacterial world was scratched in this paper. Arrays with additional unnatural -- but industrially interesting -- substrates are hinted at in the paper. Finally, given the reawakened interest in small molecule metabolism in higher organisms & their diseases (such as cancer), application of this method to human samples can't be far behind.
Ana Beloqui, María-Eugenia Guazzaroni, Florencio Pazos, José M. Vieites, Marta Godoy, Olga V. Golyshina,, Tatyana N. Chernikova, Agnes Waliczek, Rafael Silva-Rocha, Yamal Al-ramahi, Violetta La Cono, Carmen Mendez, José A. Salas, Roberto Solano, Michail M. Yakimov, Kenneth N. Timmis, Peter N. Golyshin, & Manuel Ferrer (2009). Reactome array: Forging a link between metabolome and genome Science, 326 (5950), 252-257 : 10.1126/science.1174094

Wednesday, October 07, 2009

The genomic history of a breast cancer revealed

Today's Nature contains a great paper which is one more step forward for cancer genomics. Using Illumina sequencing a group in British Columbia sequenced both the genome and transcriptome of a metastatic lobular (estrogen receptor positive) breast cancer. Furthermore, they searched a sample of the original tumor for mutations found in the genome+transcriptome screen in order to identify those that may have been present early vs. those which were acquired later.

From the combined genome sequence and RNA-Seq data they found 1456 non-synonymous changes which was then trimmed to 1178 after removing pseudogenes and HLA sequences. 1120 of these could be re-assayed by Sanger sequencing of PCR amplicons from both normal DNA and the metastatic samples -- 437 of these were confirmed. Most of these (405) were found in the normal sample. Of the 32 remaining, 2 were found only in the RNA-Seq data, a point to be addressed later below. Strikingly, none of the mutated genes were found in the previous whole-exome sequencing (by PCR+Sanger) of breast cancer, though those samples were of a different subtype (estrogen receptor negative).

There are a bunch of cool tidbits in the paper, which I'm sure I won't give full justice to here but I'll do my best. For example, several other papers using RNA-Seq on solid cancers have identified fusion proteins, but in this paper none of the fusion genes suggested by the original sequencing came through their validation process. Most of the coding regions with non-synonymous mutations have not been seen to be mutated before in breast cancer, though ERBB2 (HER2, the target of Herceptin) is in the list along with PALB2, a gene which when mutated predisposes individuals to several cancers (and is also associated with BRCA2). The algorithm (SNVMix) used for SNP identification & frequency estimation is a good example of an easter egg, a supplementary item that could easily be its own paper.

One great little story is HAUS3. This was found to have a truncating stop codon mutation and the data suggests that the mutation is homozygous (but at normal copy number) in the tumor. A further screen of 192 additional breast cancers (112 lobular and 80 ductal) for several of the mutations found no copies of the same hits seen in this sample, but two more truncating mutations in HAUS3 were found (along with 3 more variations in ERBB2 within the kinase domain, a hotspot for cancer mutations). HAUS3 is particularly interesting because until about a year ago it was just C4orf15, an anonymous ORF on chromosome 15. Several papers have recently described a complex ("augmin") which plays a role in genome stability, and HAUS3 is a component of this complex. This starts smelling like a tumor suppressor (truncating mutations seen repeatedly; truncating mutation homozygous in tumor; protein in function often crippled in cancer), and I'll bet HAUS3 will be showing up in some functional studies in the not too distant future.

Resequencing of the primary tumor was performed using amplicons targeting the mutations found in the metastatic tumor. These amplicons were small enough to be spanned directly by paired-end Illumina reads, obviating the need for library construction (a trick which has shown up in some other papers). By using Illumina sequencing for this step, the frequency of the mutation in the sample could be estimated. It is also worth noting that the primary tumor sample was a Formalin Fixed Paraffin Embedded slide, a way to preserve histology which is notoriously harsh on biomolecules and prone to sequencing artifacts. Appropriate precautions were made, such as sequencing two different PCR amplifications from two different DNA extractions. The sequencing of the primary tumor suggests that only 10 of the mutations were present there, with only 4 of these showing a frequency consistent with being present in the primary clone and the others probably being minor components. This is another important filter to suggest which genes are candidates for being involved in early tumorigenesis and which are more likely late players (or simply passengers).

One more cool bit I parked above: the 2 variants seen only in the RNA-Seq library. This suggested RNA editing and also consistent with this an RNA editase (ADAR) was found to be highly represented in the RNA-Seq data. Two genes (COG3 and SRP9) showed high frequency editing. RNA editing is beginning to be recognized as a widespread phenomenon in mammals (e.g. the nice work by Jin Billy Li in the Church lab); the possibility that cancers can hijack this for nefarious purposes should be an interesting avenue to explore. COG3 is a Golgi protein & links of the Golgi to cancer are starting to be teased out. SRP9 is part of the signal recognition particle involved in protein translocation into the ER -- which of course feeds the Golgi. Quite possibly this is coincidental, but it certainly rates investigating.

One final thought: the next year will probably be filled with a lot of similar papers. Cancer genomics is gearing up in a huge way, with Wash U alone planning 150 genomes well before a year from now. It seems unlikely that those 150 genomes will end up as 150 distinct papers and more so it will be a challenge to do the level of follow-up in this paper on such a grand scale. A real challenge to the experimental community -- and the funding establishment -- is converting the tantalizing observations which will come pouring out of these studies into validated biological findings. With a little luck, biotech & pharma companies (such as my employer) will be able to convert those findings into new clinical options for doctors and patients.
Sohrab P. Shah, Ryan D. Morin, Jaswinder Khattra, Leah Prentice, Trevor Pugh, Angela Burleigh, Allen Delaney, Karen Gelmon, Ryan Guliany, Janine Senz, Christian Steidl, Robert A. Holt, Steven Jones, Mark Sun, Gillian Leung, Richard Moore, Tesa Severson, Greg A. Taylor, Andrew E. Teschendorff, Kane Tse, Gulisa Turashvili, Richard Varhol, René L. Warren, Peter Watson, Yongjun Zhao, Carlos Caldas, David Huntsman, Martin Hirst, Marco A. Marra, & Samuel Aparicio (2009). Mutational evolution in a lobular breast tumor profiled at single nucleotide resolution Nature, 461, 809-813 : 10.1038/nature08489

Tuesday, October 06, 2009

Diagramming the Atari Pathway

Okay, it was an outside speaker at work who planted this seed in my brain, and now I can't shake the image -- but perhaps by writing this I will (but also perhaps I will infect my loyal readers with it).

The stated observation was that some biological pathway diagrams "look like Space Invaders". Now, I hold such games dear to my heart -- they were quite the rage in our neighborhood growing up, though we didn't own one & I was never very good. Nowadays one can buy replicas which play many of the old games -- except the entire system fits inside the replica of the old joysticks. My hardware-oriented brother loves to point out all the interesting workarounds which are now fossilized in these players -- such as limits on the number of moving graphics ("sprites") which could occupy a scan line.

But which video game seems to be the model for some of these diagrams? Space invaders is an obvious candidate (or one of the knockoffs or follow-ons such as Galaga), but my old favorite Centipede (or its successor Millipede) is even closer -- they even had spiders trying to spin webs.

It would be a pretty funny visual joke -- saved for precisely the right time (the wrong time could be disaster!) -- to have a pathway display morph into a game. The transcription factors start moving about and crashing into the kinases which in turn blast away at the receptors.

Versions of the reverse have sometimes occupied my mind -- what if we could make scientific programs more game-like? The notion I most commonly ponder is a flight simulator for protein structures. Even that could be taken to another level -- your X-wing is flying down a canyon of the giant structure, ready to unleash a boronic warhead to destroy the evil proteasomic death star!

Why does PNAS clip their RSS feeds?

Okay, minor pet peeve. I've pretty much switched over to using Outlook as an RSS reader to keep up with journals of interest. I still get a few ToC by email, but the RSS mechanism has lots of advantages. First, I'm in Outlook all the time, so it's a natural place. Second, I can leave behind copies of the papers of interest, with all the tools in Outlook for moving them or tagging them & such. One minor annoyance is you can't (as least as far as I can tell) force a scan of the RSS feeds. Sure, mostly this is obsessive or time-killing, but when you have intermittent net access it's really handy.

But one big difference in the feeds. Most ToC feeds send out one entry per article and that entry contains the title, authors & abstract. But PNAS sends out only the authors, title & a very short head end of the abstract. Aaaaarrrrrgggghhhh! Lost is much of the ability to vet my level of interest in an article plus the additional keywords which would enhance searching for it.

I realize PNAS is already busy with torquing their acceptance channels, but could someone who knows someone there in power please get them to fix this?!!?

Thursday, October 01, 2009

Pondering Polonators

Standing next to the Polonator like a proud relative is Kevin McCarthy, who leads the Polonator effort at Dover Systems. I had remembered him giving permission to photograph it at the first day of the Providence meeting & brought my camera along the second day. When I mentioned it was for my blog, Kevin leaped into the frame. All in good fun!

The Polonator is an intriguing gadget. No other next-gen sequencer can be had for under $200K -- or about 1/2 to 1/4 the price of any of the other instruments. But it's no tinfoil-and-paperclip contraption -- not only does it look very solid & professional, with everything laid out neatly in the cabinet, but in one small test it was quite robust. Kevin had it running mock sequencing cycles and he said "if you put your hand on the stage". I thought he was being hypothetical, but then he politely insisted I do just that. Clearly he wasn't worried about anything going wrong (and somehow I was convinced my hand would emerge unscathed!). In his talk, Kevin pointed out the various vibration isolation schemes engineered in -- you need not tiptoe past it during operation, despite the fact that it is doing some amazingly high-precision imaging.

The truly intriguing angle on the Polonator is that it is a completely open architecture. If you want to play around with different chemistries, go ahead (but please respect appropriate licenses!). I'm guessing you could probably run any of the existing amplification-based chemistries on it (again, licenses might be an issue) -- presumably with a loss of performance. Of course, with 454 you need continuous watching of a small bit of the flowcell, so the machine isn't ideal. But that isn't the point -- you could use this as a general hardware & software chassis to experiment. I speculated previously that some new sequencing-by-synthesis chemistries could be run on the Polonator, and on further reflection I'm wondering if the optical-based nanopore scheme could be prototyped on a Polonator. NAR published earlier this year another proposed chemistry that would seem Polonator-friendly.

If you wish to reprogram the fluidics, go ahead! If you wish to image only in 1 color (the default chemistry requires 4), that's programmable. Everything is programmable.

That's pretty enticing from a techie angle, but it's also a pretty risky business strategy. Generally such an expensive gadget is either paid for with a hefty markup up front and/or a hefty premium on reagents. But, while standard reagent kits are on the way, there's nothing proprietary about them. Anyone can whip up their own. Just like the hardware & software, the wetware is all open as well.

There's also the issue of the current chemistry, which appears to be the original Church lab sequencing-by-ligation scheme. That means a bunch of sample prep steps and very short reads -- 26 nucleotides of tag. The tags are derived from the original sequence in a predictable way but which isn't quite like getting two simple paired-end or mate-pair reads. That may be a barrier to many software toolsmiths including Polonator in their code, though perhaps with wide acceptance that would happen. But, with 10Gbases of data after 80 hours of running, it may attract some attention!

I'm trying to figure out how I would use one if I had one. In the abstract sense, polony sequencing has already been shown quite capable of sequencing bacterial genomes. Also, Complete Genomics' chemistry generates reads in the same ballpark and they are tackling human. But would I have the courage to try that? Certainly in my current professional situation it would be going out a bit on a limb. Plus, even at under $200K it really needs to be kept busy to look like a good buy. Does almost make me wish I was back in graduate school, as that is the time to experiment with such cool toys!

On the other hand, I do have some notions of what I might try out on one. Not enough notions to be able to justify buying one, but certainly if I could rent some time on one at a reasonable price I'd jump at the notion. With luck, a service provider or two will decide to offer Polonating as a service. Or, perhaps someone who has bought one might be interested in collaborating on some interesting clinically-relevant projects? If so, leave me a comment here (which I won't make visible) & we can talk!