Omics! Omics!

Tuesday, September 28, 2010

Scenes from the Cancer Personalized Medicine Wilderness

I'm going to attempt to synthesize a number of thoughts which I've long pondered along with a bunch of news items I came across today. With luck, the result will be coherent and I'll not make a fool of myself.

There was a very interesting article last week in the New York Times on a serious ethical dilemma in melanoma and how different specialists in the field are voicing opinions on both sides of the divide. Even better, today I came across an excellent blog post reviewing that article which also added a lot of expert background. I'll summarize the two very quickly.

Metastatic melanoma is an awful diagnosis; the disease is very aggressive. Furthermore, the standard-of-care chemotherapy drug is a very ugly cytotoxic, with nasty side effects and very poor efficacy (more on that later). Sequencing studies have revealed that well over half of metastatic melanomas have a mutant form of the kinase B-RAF (gene: BRAF), most commonly the mutation V600E (which, alas, due to some sequencing error was for a while known as V599E). That's the substitution of an acidic residue (glutamate) for a hydrophobic one (valine), and it is right in the kinase active site.

Now, a biotech called Plexxikon, in conjunction with Roche, has developed an inhibitor of B-RAF called PLX4032. In Phase I trial results reported this summer in the New England Journal of Medicine, very promising tumor regressions were seen. Now remember, this was a single-arm Phase I trial for safety, meaning we don't have an objective comparison to make.

And there begins the rub. To some doctors (and many patients), the combination of great preclinical results, the theoretical and experimental underpinnings for targeting B-RAF in melanoma and the observed regression means we have a winner on our hands and it is now unethical to have a randomized trial comparing the new compound against the standard-of-care.

At the other pole are doctors who worry that we have been fooled before.
My standard example to trot out for such cases is a famous CAST cardiovascular trial to which a placebo arm was grudgingly added -- a sound theory had been advanced that
suppressing arrythmias in certain patients would prevent death. CAST was stopped early when it was clear the placebo arm fared far better; the toxicities of the drugs overwhelmed any benefits. Even closer to our current story is the drug sorafenib, which was originally developed as a B-RAF antagonist. Now, there are many in the field who argued that it really wasn't, but Bayer and Onyx got it to market (probably based on its inhibition of numerous other kinases) and the "raf" syllable in the generic name points to their belief in the B-RAF theory. Unfortunately, in randomized clinical trials it failed to work in V600E melanomas.

One idea that was apparently floated by at least one oncologist working in the trials, but rejected by the corporate sponsors, was to try to win approval based on nearly miraculous recoveries seen in some patients on death's door. What the NYT article failed to discuss is whether the FDA would buy that argument; there are many reasons to think they wouldn't -- they really do not like single arm trials, because all too often spurious results occur do to random chance (or rarely, to manipulation of the trial).

An important idea discussed in all this is the concept that once we have established a therapy as efficacious, it is generally unethical to withhold that therapy from patients. But, we are often not on such solid ground even in this area. Clinical trials represent a horrible case of multiple testing; more than a few drugs that squeaked through their trial would not if you ran the trial again; they just got lucky. Don't believe me? Think back to Iressa, which received accelerated approval for lung cancer and then had it withdrawn (only to later be reintroduced). We now know a key piece of that particular puzzle: Iressa works in patients whose tumors have mutant forms of the EGFR. The first trial, by chance, was enriched for such patients and the second trial (also by chance) was not as enriched. Given that the EGFR hypothesis wasn't known, neither trial could have been manipulated.

But another recent item, covered in a different post on the same blog, reminds us that even well-established clinical approaches may not hold true over time. Screening mammography is a hot potato issue in cancer: can you save lives by screening healthy women for breast cancer. Various studies have tried to ask this question not just for women overall, but by age groups since the incidence of breast cancer and the quality of mammograms changes with patient age. The newest fuel on this fire is a very clever Norwegian study, which I won't attempt to summarize, that suggests that much (but perhaps not all) of the benefit of screening mammography has been eroded by improvements in cancer care. In other words, the advantage of early detection has been blunted by better treatments. Now, I'm not qualified to really review that study, but certainly this is a concept we should keep in mind: the utility of medical strategies may change over time, and not always for the better.

In my mail tonight was a thick magazine-sized volume from Scientific American, which I confess I am not a subscriber of (it's a fine magazine; I just already subscribe to too many fine magazines). This special edition, titled "Pathways: The changing science, business & experience of health", focuses on healthcare with a mix of articles. Some appear to be written by professional writers, while others are thinly-veiled advertisements for various companies.

In scanning the table of contents, I was caught by "Pioneering Personalized Cancer Care", though unfortunately this turns out to be one of the puffier pieces. Written by two principles in the company, it mostly describes N-of-one, a company which has as its customers cancer patients. N-of-one tries to distill the available knowledge on a person's tumor and help them navigate to the most appropriate tests. It's a business model I've sometimes wondered about for myself, since playing an oncologic Sherlock Holmes could be both fascinating and rewarding. On the other hand, the regulatory environment is fraught with uncertainty and most likely this sort of organization will have to rely on wealthy customers willing to pay their own way.

Now, the article did set my teeth on edge early on with the statement "Recently, projects such as the Cancer Genome Atlas have documented thousands of mutations in cancer cells that can lead to unregulated cell growth and prevent apoptosis (cell death), the hallmarks of malignancy". Any regular reader of this space knows that I am a gung-ho proponent of sequencing tumors, but with that comes an obligation to be honest. And the honest truth is that sequencing has yielded thousands of candidates, but only a handful of those have actually been shown to have transforming ability -- there's just no high-throughput way to do that en masse.

But, what N-of-one and others are doing is where I strongly believe the future of oncology lies. But, it will be a complicated place. Getting back to B-RAF, I've heard noise that it has been found in a number of additional tumor types, albeit at low frequency. So, supposes it occurs at 1 in 1000 frequency in some awful tumor type. With routine whole-genome sequencing of tumors, we could detect that. Such sequencing is starting to be used to good effect, as reported recently in Nature. That leads to a conundrum for everyone. For a patient or clinician, do you go with PLX4032, given that we know it targets BRAF -- but knowing that we don't know whether BRAF is really driving your tumor (especially if the mutation is not V600E)? For those wanting to design clinical trials, could you really find enough patients to stock a trial -- or are you willing to have a trial with "any cancer, as long as it has a BRAF mutation"?

This is the challenge that personalized medicine presents us. With genome sequencing (and eventually also routine whole methylome profiling), we can find what makes cancers different -- but how will we ever actually sort through all those differences? Should we move away from randomized trials to going where the science seems to lead us, even knowing that more than a few times there have been dead ends?

I can find only one easy answer to all this: don't trust anyone who offers an easy answer to all this.

Tuesday, September 21, 2010

Review: The $1000 Genome

Kevin Davies' "The $1000 Genome" deserves to be widely read. Readers of this space will not be surprised that there are a few changes I might have imposed had I been its editor, but on the whole it presents a careful and I think entertaining view of the past and possible future of personal genomics.

The book is intended for a far wider audience than geeky genomics bloggers, so the emphasis is not on the science. Rather, it is on some of the key movers-and-shakers in the field and some of the companies which have been dominating this space, ranging from the first personal genetic mapping companies (23 and Me, Navigenics, Pathway Genomics and deCodeMe) to the instrument makers (such as Solexa/Illumina, Helicos, Pacific Biosciences, ABI and Oxford Nanopore) to those working on various aspects of human genome sequencing services (such as Knome and Complete Genomics. Various ups and downs of these companies -- and the debates they have engendered -- are covered as well as the possible impacts on society. Along the way, we see a few glimpses of Davies exploring his own genome and some of the biological history which he seeks to enlighten through these expeditions.

It is not a trivial task to try to explain this field to an educated lay public, but I think in general Davies does a good job. The overviews of the technologies are limited but give the gist of things. Anyone writing in this space is faced with the dilemma of trying to explain too much and losing the main thread or failing to explain and preventing the reader from finding it. Mostly I think he has succeeded in threading this needle, perhaps because only rarely did I feel he had missed. One example I did note was in explaining PacBio's technology; hardly anyone in science will know what a zeptoliter is, let alone someone outside of it. On the other hand, what analogy or refactoring of that term could remove it from the edges of science fiction? Not an easy challenge!

For better or worse, once I've decided I generally like a book like this my next thoughts are what could be removed and what could be added. I really could find little to remove. But, there are a few things I wish were either expanded or had made it in altogether.

It would be dreary to enumerate every company which has ever thrown its hat in the DNA sequencing ring. It is valuable that Davies covers a few of the abject failures, such as Manteia (which did yield some key technology to Illumina when sold for assets) and US Genomics. There is scant coverage, other than by mention, of most of the companies which have but nascent attempts to enter the arena. However, the one story I really did miss was anything about the Polonator. It's not that I really think this system will conquer the others (though perhaps I hope it will hold its own), it just represents a very different tack in corporate strategy that would have been interesting to contrast with the other players.

Davies has been in the thick of the field as editor of Bio IT World, so this is no stitching together of secondary sources. I also appreciated that he includes both the ups and the downs for these companies, emphasizing that this has not been easy for any of them. But, that added to my surprise at several incidents which were left out (believe me, many were left in I had never heard before). Davies describes how Helicos delivered an instrument to the CRO Expression Analysis, but not that it was very publicly returned for failing to perform to spec. Nor is Helicos' failed attempt to sell themselves mentioned. An interesting anecdote on Complete Genomics is how a wildfire nearly disrupted one of their first human genome runs; left out is the near-death experience of that company when it was forced to either lay off or defer salaries for nearly all of its staff. The section on Complete's founder Rade Drmanac mentioned Hyseq, but not the company (or was it two) which he ran between Hyseq and Complete to try to commercialize sequencing-by-hybridization. This would have added to this portrait of determination -- and the travails of the corporate arena. I was also surprised that the short profile of Sydney Brenner as a personal genomics skeptic didn't include the fact he invented the technology behind Lynx, which was another early attempt in non-electrophoretic sequencing. Some would see that as irony.

Another area I would like to have seen expanded was the exploration of groups such as Patients Like Me, which are windows on how much people are willing to chance disclosing sensitive medical information. One section explores the fact that several prominent persons interested in this field became so when their children were diagnosed with rare recessive disorders, leading them to ponder whether they would have made the same marriage had they known in advance of this danger. I was surprised that little of the existing experience in this area was explored; I believe the Ashkenazi population has dealt with this in screening for Tay-Sachs and other horrific disorders which are prevalent there.

The book is stunningly up-to-date for something published the beginning of September; some incidents as late as June are reported. Despite this, I found little evidence of haste. I'm still trying to figure out what a "nature capitalist" is, but that's the only case I spotted of a likely mis-wording.

Davies briefly explores possible uses of these sequencing technologies beyond our germline sequences, but only very briefly. Personally, I think that cancer genomics will have a more immediate and perhaps greater overall impact on human medicine, and wish it had gotten a bit more in depth treatment.

Davies in a expatriot Brit, living not very far from me. The sections on the possible impact of widespread genome sequencing on medicine are written almost entirely from a U.S. perspective, with our hybrid public-private healthcare system. I suspect European readers would hunger for more discussion of how personal genomics might be handled within their socialized medical systems and different histories of handling the ethical issues (Germany, I believe, has pretty much banned personal genomics services). On this side of the pond, he does a nice job of showing how different state agencies have charged into the breach left, until recently, by the FDA.

Okay, too many quibbles. Well, maybe one last one -- it would have been nice to see more on some of the academic bioinformaticians who have created such wonderful and amazing open-source tools as Bowtie and BWA.

As I mentioned above, Davies injects a good amount of himself into all this. I've encountered books (indeed, on recently on moon walkers), in which this becomes a tedious over-exposure to the author's ego. This is not such a book. The personal bits either link pieces of the story or make them more approachable. We find out that he has already attained a greater age than his father did (due to testicular cancer, one of the few cancers in which overwhelming progress has been made), leading to questions he hopes his genome can answer. Hence, his trying out of pretty much all of the array-based personal genetic services. But, he does not address one question that the book raised in my mind: will the royalties from this project fund a complete Davies genome?

Saturday, September 11, 2010

ARID1A A Fertile Ground for Mutations in Ovarian Clear Cell Carcinoma

Although ovarian clear cell carcinoma does not respond
well to conventional platinum–taxane chemotherapy
for ovarian carcinoma, this remains
the adjuvant treatment of choice, because effective
alternatives have not been identified.

This sentence is a depressing reminder of the status of medical treatment of far too many tumor types. Present in roughly 12% of U.S. ovarian cancer cases, ovarian clear cell carcinoma (OCCC) is a dreadful diagnosis.

Two papers this week made a significant step forward in understanding the molecular basis -- and heterogeneity -- of this horror. Seemingly the finale of an old-fashioned race to publish, groups centered at the British Columbia Cancer Center (in New England Journal of Medicine) and Johns Hopkins University (in Science) published papers with the same headline finding: inactivating mutations in the chromatin regulating gene ARID1A (whose gene product is known as BAF250) are a key step in many -- but not all -- OCCC. I'll use the shorthand Vancouver and Baltimore to refer to the respective groups.

Both papers got here by the largest applications of second generation sequencing to cancer so far published. The Vancouver work relied on transcriptome sequencing (RNA-Seq) of a discovery cohort of 18 patients; the Baltimore group used hybridization targeted exome sequencing on just 8 patients. Both used Illumina paired-end sequencing for the discovery phase; Vancouver also used the same platform for validation on a larger cohort.

Whole genome sequencing is likely the future for cancer genomics. A non-cancer paper just published 20 genomes in one shot, underscoring how this is becoming routine with easy samples & a work which is apparently in press (I have no inside knowledge; it has been discussed at several public meetings) will have perhaps a dozen human genomes in it. But, there are still cost advantages to focusing on expressed genetic regions (and perhaps a bit more) and perhaps further information to be gleaned from actually looking a gene expression. These two papers give an opportunity, albeit a bit constrained, to compare the two approaches.

One interesting note comes straight out of the Vancouver data. After finding ARID1A mutations in 6/18 discovery samples, they re-screened those samples plus 211 additional samples. In total this set included 1 OCCC cell line, 119 OCCC, 33 endometrioid carcinomas and 76 high-grade serous carcinomas. The validation screen was by long-range PCR (mean product size 2067 bp) products sheared and sequenced on the Illumina. One exon proved troublesome and required further PCR and sequencing by Sanger. In any case, the key bit here is in the discovery cohort this approach found ARID1A mutations which had been missed by the original RNA-Seq. As the authors state, a likely culprit is nonsense mediated decay (NMD). It would be interesting to go into their dataset to see if these samples had a markedly lower expression of ARID1A, though I don't have easy access to it (it has been deposited, but with protections that should be the subject of a future post).

One interesting contrast between the two studies is the haul of genes. The Vancouver group found ARID1A as a recurrently mutated gene; the Hopkins group not only bagged ARID1A but also KRAS, PIK3CA and PPP2R1A. KRAS and PIK3CA are well-known oncogenes in multiple tumor types and had previously been implicated in OCCC, but PP2R1A is a novel find. The Vancouver group did specifically search for KRAS and PIK3CA mutants in their cohorts by PCR assays and found one patient sample and one cell line with KRAS mutations. Again, it would be interesting to review the RNA-Seq data to generate hypotheses as to why these were not found in the Vancouver set. On the other hand, the RNA-Seq data did identify one case of a rearranged ARID1A. While it is possible to use hybridization capture to identify gene fusions, this cannot be practically done in a hypothesis-free manner. In other words, without advance interest in ARID1A that approach would not work. In addition, CTTNB1 (beta catenin) mutations had been found previously in OCCC and were specifically checked (and found) by the Vancouver group, but none were reported by the Baltimore group. One final small discrepancy: both groups looked at cell line TOV21G for their mutations of interest and both found the same activating KRAS and PIK3CA alleles. However, Vancouver found one ARID1A allele but Baltimore found that one and a second one (actually, the two mutations I am calling the same [1645insC and 1650dupC] aren't described precisely the same, though I'm guessing it is a difference in an ambiguous alignment).

One other surprise is that TP53 (p53) and PTEN mutants had apparently been reported either for OCCC or endometriosis-associated tumors, yet neither group reported any.

An analysis that is not explicitly found in either paper but I feel is valuable is to look at the co-occurrence of these mutations. If we look only at patient samples, then the big take-home is that neither group saw co-occurrence of KRAS and ARID1A (the TOV21G cell line is at odds with this conclusion). Mutually-exclusive mutations have been seen in many tumors. For example, KRAS mutations are generally mutually-exclusive with other mutations in the RTK-RAS-RAF-MAPK pathway. In contrast, ARID1A mutations are found in conjunction with mutations in CTTNB1, PIK3CA and PPP2R1A -- one patient sample in the Baltimore data was even triple mutant for ARID1A, PIK3CA and PPP2R1A. About 30-40% of sample are mutated for none of these genes as far as this data can tell; the hunt for further causes will continue. Will they be epigenetic? Mutations in regulatory elements?

Another interesting comparison is simply the number of mutations per sample. The Hopkins exome data typically has very small numbers of mutations (after filtering out germ line variants); as few as 13 in a sample and as many as 125 -- and the high number was from a tumor which had previously been treated with DNA-damaging agents (all of the other tumors in the Hopkins study were treatment naive). In contrast, the Vancouver data often found more than 1000 non-synonymous variants per tumor. Unfortunately, no clinical history information is available for the Vancouver cohort, so we don't know if this is from DNA-damaging therapeutics or differences in the sequencing or variant filtering. In an ideal world, we could filter each data set with the other group's filtering scheme to see how much of an effect that would have.

The Vancouver group went beyond sequencing to examine samples by immunohistochemistry (IHC) for expression of the ARID1A gene product, BAF250. There is a strong, but imperfect, negative correlation between mutations and BAF250 expression. Some mutated but BAF250-expressing samples may be explained by the target of the antibody; the truncated forms may still express the correct epitope. Alternatively, ovarian cells may be very sensitive to the dosage of this gene product (in some samples both wt and mutant alleles were clearly found in the RNA-Seq data). Also of interest will be samples lacking expression but unmutated; these may be the places to identify further mechanisms for tumors to eliminate BAF250 expression.

The Vancouver study illustrates one additional bonus from RNA-Seq data: a list (in the supplemental data) of genes differentially expressed between ARID1A mutant and ARID1A wild-type cells.

Another interesting bit from the Vancouver paper is looking at two cases in which the tumor was adjacent to endometrial tissue. In one of these, the same truncating mutation was found in the adjacent lesion and tumor -- but not in a distant endrometriosis. Hence, the mutation was not driving the endometriosis but occurred afterwards.

I'm sure I'm short-shrifting further details from the paper; there's a lot of data packed in these two reports. But, what will it all mean for ovarian cancer patients? Alas, none of the genes save PIK3CA are obvious druggable targets. PIK3CA encodes the alpha isoform of PI3 kinase, a target many companies are working on. But that wasn't novel to these papers. PP2R1A is a regulatory subunit of a protein phosphatase and the mutations are concentrated on a single amino acid, suggesting these are activating mutations (as seen in ARID1A, inactivating mutations can sprawl all over a gene). Phosphatases have not been a productive source of drugs in the past, but perhaps that can be changed in the future. Chromatin regulation is a hot topic, but ARID1A is deficient here, not active. Given that tumors can apparently live with two mutated copies, the idea of further inactivating complexes with ARID1A mutations is probably not a profitable one. But, perhaps there is a ying-yang relationship with another chromatin regulator which can be leveraged. In other words, perhaps inhibiting an opposing complex could restore balance to the cell's chromatin regulation and inhibit the tumor. That's the sort of work which can build off of the foundation these two cancer genomics papers have provided.

Kimberly C. Wiegand, Sohrab P. Shah, Osama M. Al-Agha, Yongjun Zhao, Kane Tse, Thomas Zeng, Janine Senz, Melissa K. McConechy, Michael S. Anglesio, Steve E. Kalloger, Winnie Yang, Alireza Heravi-Moussavi, Ryan Giuliany,Christine Chow, John Fee, Abdalnas (2010). ARID1A Mutations in Endometriosis-Associated Ovarian Carcinomas New England Journal of Medicine : 10.1056/NEJMoa1008433

Jones S, Wang TL, Shih IM, Mao TL, Nakayama K, Roden R, Glas R, Slamon D, Diaz LA Jr, Vogelstein B, Kinzler KW, Velculescu VE, & Papadopoulos N (2010). Frequent Mutations of Chromatin Remodeling Gene ARID1A in Ovarian Clear Cell Carcinoma. Science (New York, N.Y.) PMID: 20826764

Tuesday, August 31, 2010

Worse Could Be Better

My eldest brother was in town recently on business & in our many discussions reminded me of the thought-provoking essay "The Rise of 'Worse is Better'". It is on a thought train similar to Clayton Christensen's books -- sometimes really elegant technologies are undermined by ones which are initially far less elegant. In the "WiB" case, the more elegant system is too good for its own good, and never gets off the ground. In Christensen's "disruptive technology" scenarios, the initially inferior serves utterly new markets priced out by the more elegant approaches, but the inferior technology then nibbles slowly but surely to replacing the dominant one. But a key conceptual requirement is to evaluate the new technology on the dimensions of the new markets, not the existing ones.

I'd argue that anyone trying to develop new sequencing technologies would be well advised to ponder these notions, even if they ultimately reject them. The newer and more different the technology, the longer they should ponder. For it is my argument that there are indeed markets to be served other than $1K high quality canid genomes, and some of those offer opportunities. Even existing players should think about this, as there may be interesting trade-offs that might go after totally new markets.

For example, I have an RNA-Seq experiment off at a vendor. In the quoting process, it became pretty clear that about 50% of my costs are going to the sequencing run and the other 50% of costs to library preparation (of course, within both of those are buried various other costs such as facilities & equipment as well as profit, but those aren't broken out). As I've mentioned before, the costs of the sequencing are plummeting but library construction is not on such a steep trend.

So, what if you had a technology that could do away with library construction? Helicos simplified it greatly, but for cDNA still required reverse transcription with some sort of oligo library (oligo-dT, random primers or a carefully picked cocktail to discourage rRNA from getting in). What if you could either get rid of that step, read the sequence during reverse transcription or not even reverse transcribe at all? A fertile imagination could suggest a PacBio-like system with reverse transcriptase immobilized instead of DNA polymerase. Some of the nanopore systems theoretically could read the original RNA directly.

Now, if the cost came down a lot I'd be willing to give up a lot of accuracy. Maybe you couldn't read mutations out or allele-specific transcription, but suppose expression profiles could be had for tens of dollars a sample rather than hundreds? That might be a big market.

Another play might be to trade read length or quality of an existing platform for more reads. For example, Ion Torrent is projected to initially offer ~1M reads of modal length 150 for $500 a pop. For expression profiling, that's not ideal -- you really want many more reads but don't need them so long. Suppose Ion Torrent's next quadrupling of features came at a cost of shorter reads and lower accuracy. For the sequencing market that would be disastrous -- but for expression profiling that might be getting in the ballpark. Perhaps a 16X the initial chip -- but with only 35bp reads -- could help drive adoption of the platform by supplanting microarrays for many profiling experiments.

One last wild idea. The PacBio system has been demonstrated in a fascinating mode they call "strobe sequencing". The gist is that the read length on PacBio is largely limited by photodamage to the polymerase, so letting the polymerase run for a while in the dark enables spacing reads apart by distances known to some statistical limits. There's been noise about this going at least 20K and perhaps much longer. How long? Again, if you're trapped in "how many bases can I generate for cost X", then giving up a lot of features for such long strobe runs might not make sense. But, suppose you really could get 1/100th the number of reads (300)-- but strobed out over 100Kb (with a 150bp island every 10Kb). I.e. get 5X the fragment size by giving up about 99% of the sequence data. 100 such runs would be around $10K -- but would give a 30,000 fragment physical map with markers spaced about every 10Kb (and in runs of 100Kb). For a mammalian genome, even allowing for some loss due to unmappable islands, that would be at least a 500X coverage physical map -- not shabby at all!

Now, I won't claim anyone is going to make a mint off this -- but with serious proposals to sequence 10K vertebrate genomes, such high-throughput physical mapping could be really useful and not a tiny business.

Sunday, August 29, 2010

Who has the lead in the $1K genome race?

A former colleague and friend has asked over on a LinkedIn group for speculation on which sequencing platform will deliver a $1K 30X human genome (reagent cost only). It is somewhat unfortunate that this is the benchmark, given the very real cost of sample prep (not to mention other real costs such as data processing), but it has tended to be the metric of most focus.

Of existing platforms, there are two which are potentially close to this arbimagical goal (that is, a goal which is arbitrary yet has obtained a luster of magic through repetition).
ABI's SOLiD 4 platform can supposedly generate a genome for $6K, though even with pricing from academic core labs I can't actually buy that for less than about $12K (commercial providers will run quite a bit more; they have the twin nasty issues of "equipment amortization" and "solvency" to deal with).
The SOLiD 4 hq upgrade is promised for this fall with a $3K/genome target. Could Life Tech squeeze that out? I'm guessing the answer is yes, as the hq does not use an optimal bead packing. Furthermore, the new paired end reagents will offer 75 bp reads in one direction but only 25 in the other.
I've never understood why a ligation chemistry should have an asymmetry to it (though perhaps it is in the cleavage step), so perhaps there is significant room for improvement there. Of course, those possible 40 cycles are not free, so whether this would help with cost/genome is not obvious (though it would be advantageous for many other reasons). Though, since they can currently get a 30X genome on one slide longer reads would enable packing more genomes per slide & perhaps that's where the accounting ends up favoring longer reads.

Complete Genomics is the other possible player, but we have an even murkier lens on the reagent costs per genome, given that Complete deals only in complete genomes and only in bulk. But, they do have to actually ensure they are not losing money (or at least, with their IPO they won't be able to hide the bleed). Indeed, Kevin Davies (who has a book on $1K genomes coming out) replied on the thread that Complete Genomics has already declared to be at $1K/genome in reagent costs. Perhaps we should move the target to something else (Miss Amanda suggests that $1K canid genomes are far more interesting).

What about Illumina? With HiSeq, they are supposedly at $10K/genome with the HiSeq and many have noted that
the initial HiSeq specs were for a lower cluster packing than many genome centers achieve. That also brings up an interesting issue of consistency -- how variable are cluster packings & therefore the output per run. In other words,
what sigma are we willing to accept in our $1K/genome estimate? Also, the HiSeq specs were for shorter reads than the 2 x 150 paired end
reads that are quite common in 1000 genomes depositions in the SRA (how much longer can Illumina go?).

So, perhaps any of these three existing platforms might meet the mark (454 is a non-starter; piling up data cheaply is not
its sweet spot). What about the ones in the wings? Of course, these are even murkier and we must rely even more on their maker's
projections (and potentially, wishful thinking).

IonTorrent's technology (to be re-branded by Life Tech?) isn't nearly there right now. For $500 (the claim is) you'd get 150Mb of data, or about 0.1X for $1000, so we need about 300X improvement. However, there should be a lot of opportunity to improve. The one touted most in the past is further improvement in the feature density; Ion Torrent was apparently already working on a chip with about 4X the number of features. If we round 300 to 256, then that would only be 4 rounds of quadruplings. If Life could pump those out every 6 months, then that would only be two years to a $1K genome. Who knows how realistic that schedule would be?

But IonTorrent could push on other dimensions as well. Because the flowcell itself is a huge chunk of the cost of a run, squeezing longer read lengths should be possible. Since 454 gets nearly 500 basepair reads routinely (and up to a kilobase when things are really humming), perhaps there is a factor of nearly 4 to get from longer reads. In a similar manner, a paired-end protocol could potentially double the amount of sequence per chip (at a cost of perhaps a bit more than double the runtime; not such a big deal if the run is really an hour). Could that be done? I think I have the schematic for an approach (which might also work on 454); trade proposals for sequencing instruments will be put to my employer for consideration! Finally, as noted in a thread on SEQAnswers, IonTorrent is apparently achieving only about a 1/8th efficiency in converting chip features to sequence-generating sites; better loading schemes might squeeze another few fold out. So perhaps IonTorrent really is 1-2 years away from having $1K genomes (much more likely the 2).

Moving on, could Pacific Biosciences (or the Life tech StarLight (nee VisiGen)) technology have a shot? Lumping them together (since we have virtually no price/performance information for StarLight), PacBio is initially promising $100 runs generating ~60Mb, so $1K would get you about 0.2X coverage, or about 150-fold off, which we'll round to 128-fold or 7 doublings. I think they've already been said to be testing a chip with twice the density, plus a better loading scheme to yield around 2X -- so perhaps it's only 5 doublings.

Finally, there are the technologies which haven't yet demonstrated the ability to read any DNA, but could do so and then move quickly (or not). In this category are any nanopore-based systems (which is a dizzying array of approaches) and Gnu Bio's sequencing-by-synthesis-in-nanodrops approach. And perhaps a few more. These don't even work yet, so even speculative price performance information isn't available.

Finally, a quick note about what a $1K genome means. The X-prize folks have set very strong standards, standards which are far beyond what any short read technology could hope to accomplish and also far beyond what many sequencing applications need. The organizers did not super-design them for no reason; there are applications which need that rigor and also it will greatly cut down on false positives. But, as the regular stream of papers shows, much lower standards will suffice to get interesting biology of whole human genomes.

Tuesday, August 24, 2010

Lawyers v. Research Funding?

An ongoing personal quest is to attempt to fill in the gaps in my original education, particularly outside the areas of science in which I feel there exist gaping chasms. Through Wikipedia, books and especially recorded college courses, I slowly patch up what the deficiencies of my education (or all too commonly, my youthful deficiencies in attention during that education) have failed to cover. I'm currently making a third pass through a wonderful course on Roman history since I enjoyed it very much the first two times.

During Rome's early expansion it was ruled by rotating sets of elected officials under a system known to us as the Roman Republic. A series of events (known to scholars as the Roman Revolution) over many decades disrupted this system, culiminating in the replacement of the Republic with the military dictatorship of the Emperors, which would remain until the fall of the empire. An initiating event in the Revolution was an official named Tiberius Gracchus, who in the service of high-minded ideals (rewarding landless soldiers with their own plots on which to support themselves), changed the nature of Roman politics by introducing mob violence to the process (as well as a certain degree of ruthlessness in dealing with the opposition of colleagues).

I fear that yesterday's court decision regarding embryonic stem cell research represents a similar horrible turn. Now, what most commentators will focus on is the very issue of creating human embryonic stem cells and whether the government should finance this. This is an area in which the proponents of both sides of the issue have deeply and sincerely held beliefs which I feel must be respected, though in the end they are fundamentally irreconcilable. But peripheral to that, the case represents a very scary intrusion of lawyers into the research funding process.

One of the claims made by the plaintiffs (in particular, the research James Sherley) is that the new guidelines on what embryonic stem cell research can be funded represent a very real cause of harm to those working on adult stem cell research; they will have more competition for research funding. That is certainly true; if we view research funding for stem cells as a zero sum game (and that is another whole can of balled waxworms I won't dela with). The danger now is that every possible change in federal (or even private?) funding aim will be an opportuntity for litigators to intrude. Wind down project X to fund project Y? LAWSUIT! Either this will dissuade funding from the ebb and flow which is necessary, or a far worse than zero sum game ensues in which funding for science instead funds litigation (or the buy-offs of potential suits which are routine in that field).

Can this genie be stuffed back in the bottle? I'm not legally trained enough to know. Perhaps it was inevitable. Perhaps we need Congress to explicitly forbid it (but would that be legal?) -- and what are the chances of that? Has a terrible Rubicon been crossed; I hope I am wrong in thinking it has.

Saturday, August 21, 2010

Varus! Where are my legions (of data)!?!?

Bring up the subject of outsourcing, and many minds will immediately jump to the idea of a company using outside services to more cheaply replace operations formerly conducted in house. But the other side of the topic is what I frequently experience: outsourcing allows me to access technologies and capabilities which I simply could not afford to do so on my own, or at least try very expensive technologies prior to investing in them. This is very useful, but has its own issues.

I've now gotten data from 4 different large outsourced sequencing projects. Rated on a five star system, they would (in order) be rated less than expected (**), complete failure (*), less than expected (**) and greater than expected (****). Samples for two more projects just shipped out last week. Given that we don't have any sort of sequencer in house (one project above was conventional Sanger) nor can we willy-nilly buy any specialized hardware for target enrichment (two projects involved enrichment), this has been valuable -- though I really wish I could have been able to rate all as greater than expected (or at least one off the charts).

After the quality of the delivered data, my next greatest frustration is with knowing when that data will be delivered. Now a few projects (plus some explicit vendor tests not included in the above) have gone on schedule, but the utter failure had the pain compounded by being grossly overdue (1-3 months, depending on how you quite define the start point) and one of the other projects came in a week overdue.

But even worse than being late is not knowing how late until the data shows up. Partly this revolves around trying to appropriately budget my time, but it also affects transmitting expectations to others awaiting the results.

In an ideal world, I'd have a real-time portal onto the vendor's LIMS -- one cancer model outfit claimed exactly this. But in any case, I'd really like to have regular updates as to the progress of my project -- and especially to what's happening if the vendor has gone into troubleshooting mode.

After all, what these outfits wish to claim is that they will act as an extension of my organization. Now, if the work was going on in house & I was concerned about progress, I could easily pop in and chat with the person(s) working on it. I'm not interested in hanging over someone's shoulder & making them nervous, but I do like to try to at least understand what is going on & what approaches are being used to solve this. Unfortunately, in several outsourcing projects this is specifically what was lacking -- no concrete estimate of a schedule nor any regular communication when projects were overdue.

In a basic sense, I'd like an update every time my project crosses a significant threshold. Now, the exact definition of that is tricky. But, imagine a typical hybridization capture targeted sequencing. The vendor receives my DNA, shears, size selects, ligates adapters, amplifies and has a library. Some QC happens at various stages. Then there is the hybridization, recovery and further amplification. At some point the platform-specific upstream-of-sequencer step occurs (cluster formation or ePCR). Then it goes on the sequencer. Each cycle of sequencing occurs, plus (for Illumina) cluster regeneration and paired end sequencing. Then downstream basecalling (if not in line). Once basecalls are done, then whatever steps occur to get me the data. And that's all the correct workflow: throw in some troubleshooting for problems should they occur.

Now, ideally I could see all of those steps. But how? I really don't want an email after every sequencer cycle. Could something like Twitter be adapted for this purpose?

Happily, the recent experience when I thought of the title for this post the data did finally come in (after some hiccups with delivery) and was quite exciting. So I'm not tearing my lab coat like Augustus. But when vendors try to solicit my business or when I'm rating the experience afterwards, the transparency and granularity of their communication will be a critical consideration. Vendors who are reading this take note!

Tuesday, August 17, 2010

Life Tech Gobbles Ion Torrent

Tonight's big news is that Life Technologies, the giant formed by the merger of ABI and Invitrogen, has acquired Ion Torrent for an eye popping $375M (mixed cash & stock) with another $325M possible in milestones and such.

I'm not shocked Ion Torrent was shopping itself; by linking with an established player Ion Torrent can access marketing channels -- a talent they have displayed a serious handicap in. While Ion Torrent was adept at creating buzz with founder Jonathon Rothberg's rock star presentations and their sequencer giveaway contests, actual marketing infrastructure to follow-up on all the leads generated through those efforts was clearly lacking (as in, they have yet to contact me!).

One interesting detail of the press release is the fact that the price point for their sequencer is placed at "below $100K"; Ion Torrent had previously billed their machine at under $50K. Is this a real shift, or does it simply reflect the true cost once sample prep gear is thrown in?

Now, there are several interesting angles to watch. First, how will Life position their full lineup of sequencers -- now that they have 3 different technologies (SOLiD, Ion Torrent & VisiGen) with very different performance characteristics. Plus, they had the SOLiD PI in line to be an entry level second generation sequencer -- how will this affect that?

Another area to watch is how tightly Ion Torrent is tied into the SOLiD line. While the chemistry is very different, there are opportunities. For example, can the EZ Bead emulsion PCR robots be used for Ion Torrent sample prep (with the whole sample prep issue being a big black box for the technology? Will the same library prep reagents for SOLiD be usable with Ion Torrent? I'd love to see that -- especially if Ion Torrent drives volumes which ultimately result in driving kit costs down. Of course, the biggest question is when can people actually buy one of the beasts?

Roche/454 seemed like a more obvious partner for Ion Torrent -- very similar chemistries & a tie-up of that sort might have meant a very rapid extension of Ion Torrent read lengths. Roche should be quite nervous; between Ion Torrent and Pacific Biosciences they are going to be under extreme pressure in long read niches and their next technology (GE's nanopores) are unlikely to be ready for many years. Ion Torrent could have also been an interesting play for a reagent company looking to jump into sequencing instruments. Such a company could have also brought the right sales network into play. A non-bio player could have happened, but I doubt that would have ended well -- Ion Torrent needs to complete their act & get their machine out to biologists.

Saturday, August 07, 2010

Perchance to dream

I had an amusing dream the other night. Nothing earth shattering: neither starved calves consuming fatted ones nor serpentine molecular orbitals. But, an amusing spin on something I had recently discussed with a friend.

In the dream, I've apparently gotten to a presentation late -- and just missed the announcement of the sample preparation upstream of the Ion Torrent instrument. I look to my side & it's a guy from Ion Torrent with all sorts of stuff in front of him, but when I try to ask him what I missed, he indicates silence. And then I wake up.

How exactly one goes from DNA to the instrument still appears to be a mystery. There is certainly nothing on the Ion Torrent website (which is rather focused on flash, not substance) to suggest it. A reasonable assumption is emulsion PCR, but there are other candidates (e.g. rolling circle).

Given that this is a rather important piece of the puzzle, there are several common guesses for why it is a mystery. One is that there are IP issues to still be resolved. In a similar vein, Ion Torrent just licensed some IP from a British company (DNA Electronics)which sounds like a near clone in terms of approach. A second is that they are still working out what approach to support. A third is that it isn't flashy enough to be worth mentioning.

Interestingly, Ion Torrent has apparently already sold a machine each to the USGS and NCI, plus there are the ones promised to the grant winners (which alas, I am not one of -- though a winning proposal was a kissing cousin of mine). And Ion Torrent certainly hasn't started beating the bushes hard for sales.

I'm still very eager to try out the Ion Torrent box. While it won't replace some of the other systems for many applications, the cost profile of Ion Torrent will open up very high throughput sequencing to many more labs. I have a number of ideas of how I might use one rather frequently. Now if only they'd try to sell me one -- and ideally in the real world, and not while I slumber!

Sunday, August 01, 2010

Curse you Larry the CEO!

A bit after getting to my current shop, I requested some serious iron for my work and it was decided I would have a Linux box. The question came up as to which flavor, and after canvassing my networks we went with Ubuntu. I had never administered a Linux system before and had to learn the whole package installation procedure, which is so easy even I could learn it. The "apt" tool works beautifully 99% of the time, not only getting and installing the package of interest but also all its dependencies. The occasional exceptions were cases where either the package of interest didn't seem to be available from a package repository or the Ubuntu repositories were behind the version I needed. But in general, it was nice and painless.

Earlier this year, it was clear I needed an Oracle play space and the obvious place was my machine -- not only is it quite powerful, but then any blow-back from any misdeeds of mine would hit only the perpetrator. However, when our skilled Oracle expert contractor tried to install Oracle, not much luck -- Oracle apparently doesn't support Ubuntu well. So the decision was made to switch to Red Hat.

This did not go cleanly -- the admins were fighting with the reinstall most of the week (the RAID drive had protections on it that did not wish to go quietly) but finally the new system was configured on Friday. So on Saturday night, I declared to "Amanda, I know what we are going to do today! Install packages!".

Now, I've actually made a consistent habit here of logging all my installs, so I had a menu of what to try to install. Some quick Googling found some guides to using the different installation tools on Red Hat. So I started trying to install stuff. A few went cleanly, but that is definitely the rarity -- and the worst part is that R is proving to be a major headache.

The problem is trying to get all the dependencies to install, and R has a heap. The fact that many have "-devel" in the title can't make things easy. Worse, one package required "tetex-latex" is no longer supported by its creator. Despite configuring multiple repositories and trying to download some packages manually, I have made little headway so far. So from that standpoint, at the moment my system is "Busted!".

Now, I could blame our contractor, but how was he to know this would be so miserable (though the comment by someone at Red Hat support that this is the first time he'd heard of someone going from Ubuntu to Red Hat does give pause!)? I could also take umbrage with the Linux community, which seems to be a hydra of endless subvariants (Ubuntu, Debian, Red Hat, Red Hat Enterprise, CentOS, Fedora, Mandriva -- and I'm sure that's an incomplete list!). But, it's easiest to blame Oracle, who doesn't support Ubuntu, and if I'm going to do that I'll single out the face of Oracle. On the other hand, it's a bit pointless to hold anger over this against Mr. Ellison. He's a CEO; they don't do much.

Friday, July 30, 2010

A huge scan through cancer genomes

Genentech and Affymetrix just published a huge paper in Nature using a novel technology to scan 4Mb in 441 tumor genomes for mutations, the largest number of tumor samples screened for many genes. Dan Koboldt over at MassGenomics has given a nice overview of the paper, but there are some bits I'd like to fill in as well. I'll blame some of my sloth in getting this out to the fact I was reading back through a chain of papers to really understand the core technique, but that's a weak excuse.

It's probably clear by now that I am a strong proponent (verging on cheerleader) for advanced sequencing technologies and their aggressive application, especially in cancer. The technology used here is intriguing, but it is in some ways a bit of a throwback. Now, on thinking that (and then saying it aloud) forces me to think about why I say that and perhaps this is a wave of the future, but I am skeptical -- but that doesn't detract from what they did here.

The technology, termed "mismatch repair detection", relies on some clever co-opting of the normal DNA repair mechanisms in E.coli. So clever is the co-opting, that the repair mechanisms are used to sometimes break a perfectly good gene!

The assay starts by designing PCR primers to generate roughly 200 bp amplicons. A reference library is generated from a normal genome and cloned into a special plasmid. This plasmid contains a functional copy of the Cre recombinase gene as well as the usual complement of gear in a cloning plasmid. This plasmid is grown in a host which does not Dam methylate its DNA, a modification in E.coli which marks old DNA to distinguish it from newly synthesized DNA.

The same primers are used to amplify target regions from the cancer genomes. These are cloned into a nearly identical vector, but with two significant differences. First, it has been propagated in a Dam+ E.coli strain; the plasmid will be fully methylated. Second, it also contains a Cre gene, but with a 5 nucleotide deletion which renders it inactive.

If you hybridize the test plasmids to the reference plasmids and then transform E.coli, one of two results occur. If there are no point mismatches, then pretty much nothing happens and Cre is expressed from the reference strand. The E.coli host contains an engineered cassette for resistance to one antibiotic (Tet) but sensitivity to another antibiotic (Str). With active Cre, this cassette is destroyed and the antibiotic resistance phenotype switched to Tet sensitivity and Str resistance.

However, the magic occurs if there is a single base mismatch. In this case, the methylated (test) strand is assumed to be the trustworthy one, and so the repair process eliminates the reference strand -- along with the functional allele of Cre. Without Cre activity, the cells remain resistant to Tet and sensitive to Str.

So, by splitting the transformation pool (all the amplicons from one sample transformed en masse) and selecting one half with Str and the other with Tet, plasmids are selected that either carry or lack a variant allele. Compare these two populations to a two-color resequencing array and you can identify the precise changes in the samples.

A significant limitation of the system is that it is really sensitive only for single base mismatches; any sort of indels or rearrangements are not detectable. The authors wave indels away ash "typically are a small proportion of somatic mutation", but of course they are a very critical type of mutation in cancer as they frequently are a means to knock out tumor suppressors. For large scale deletions or amplifications they use a medium density (244K) array, amusingly from Agilent. Mutation scanning was performed in both tumor tissue and matched normal, enabling the bioinformatic filtering of germline variants (though dbSNP was apparently used as an additional filter).

No cost estimates are given for the approach. Given the use of arrays, the floor can't be much below $500/sample or $1000/patient. The MRD system can probably be automated reasonably well but with a large investment in robots. Now, a comparable second generation approach (scanning about 4Mb) using any of the selection technologies would probably run $1000-$2000 per sample (2X that per patient), or perhaps 2-4X as much. So, if you were planning such an experiment you'd need to trade off your budget versus being blind to any sort of indels. The copy number arrays add expense but enable seeing big deletions and amplifications, though with sequencing the incremental cost of that information in a large study might be a few hundred dollars.

I think the main challenge to this approach is it is off the beaten path. Sequencing based methods are receiving so much investment that they will continue to push the price gap (whatever it is) closer. Perhaps the array step will be replaced with a sequencing assay, but the system both relies on and is hindered by the repair system's blindness to small indels. Sensitivity for the assay is benchmarked at 1%, which is quite good. Alas, no discussion was made of amplicon failure rates or regions of the genome which could not be accessed. Between high/low GC content and E.coli-unfriendly human sequences, there must have been some of this.

There is another expense which is not trivial. In order to scan the 4Mb of DNA, nearly 31K PCR amplicons were amplified out of each sample. This is a pretty herculean effort in itself. Alas, the Materials & Methods section is annoyingly (though not atypically) silent on the PCR approach. With correct automation, setting up that many PCRs is tedious but not undoable (though did they really make nearly 1K 384 well plates per sample??). But, conventional PCR quite often requires about 10ng of DNA per amplification, with a naive implication of nearly half a milligram of input DNA -- impossible without whole genome amplification, which is at best a necessary evil as it can introduce biases and errors. Second generation sequencing libraries can be built from perhaps 100ng-1ug of DNA, a significant advantage on this cost axis (though sometimes still a huge amount from a clinical tumor sample).

Now, perhaps one of the microfluidic PCR systems could be used, but if the hybridization of tester and reference DNAs requires low complexity pools, a technique such as RainDance isn't in the cards. My friend who sells the 48 sample by 48 amplicon PCR arrays would be in heaven if they adopted that technology to run these studies.

One plus of the study is a rigorous sample selection process. In addition to requiring 50% tumor content, every sample was reclassified by a board-certified pathologist and immunohistochemistry was used to ensure correct differentiation of the three different lung tumor types in the study (non-small cell adenocarcinoma, non-small cell squamous, and small cell carcinoma). Other staining was used to subclassify breast tumors by common criteria (HER2, estrogen receptor and progesterone receptor) and the prostate tumors were typed by an RT-PCR assay for a common (70+% of these samples!) driver fusion protein (TMPRSS2-ERG).

Also, it should be noted that they experimentally demonstrated a generic oncogenic phenotype (anchorage independent growth) upon transformation with mutants discovered in the study. That they could scan for so much and test so few is not an indictment of the paper, but a sobering reminder of how fast mutation finding is advancing and how slowly our ability to experimentally test those findings.

Kan Z, Jaiswal BS, Stinson J, Janakiraman V, Bhatt D, Stern HM, Yue P, Haverty PM, Bourgon R, Zheng J, Moorhead M, Chaudhuri S, Tomsho LP, Peters BA, Pujara K, Cordes S, Davis DP, Carlton VE, Yuan W, Li L, Wang W, Eigenbrot C, Kaminker JS, Eberhard DA, Waring P, Schuster SC, Modrusan Z, Zhang Z, Stokoe D, de Sauvage FJ, Faham M, & Seshagiri S (2010). Diverse somatic mutation patterns and pathway alterations in human cancers. Nature PMID: 20668451

Thursday, July 22, 2010

Salespeople, don't forget your props!

I had lunch today with a friend & former colleague who sells some cool genomics gadgets. One thing I've noted about him is whenever we meet he has a part of his system with him; it's striking how often this isn't the case (he's also been kind enough to leave me one).

Now, different systems have different sorts of gadgets with different levels of portability and attractiveness. The PacBio instrument is reputed to weigh in at one imperial ton, making it impractical for bringing along. Far too many folks are selling molecular biology reagents, which all come in the same sorts of Eppendorf tubes.

But on the other hand, there are plenty of cool parts that can be shown off. Flowcells for sequencers are amazing devices, which I've seen far too few times in the hands of salespersons. One of the Fluidigm microfluidic disposables is quite a conversation piece -- and the best illustration for how the technology works. The ABI 3730 sequencer's 96 capillary array was so striking I once took a picture of it -- or I thought I had until I looked through the camera files. The capillaries are coated with Kapton (or a similar polymer), giving them a dark amber appearance. They are delicate yet sturdy, bending over individually but in an organized fashion.

However, my most favorite memory of a gadget was an Illumina 96-pin bead array device. The beads are small enough that they produce all sorts of interesting optical effects -- 96 individually mounted opals!

Of course, those gadgets are not cheap. However, any well run manufacturing process is going to have failures, which is a good source for display units. Yes, if you have a really good process you won't generate any defective products, but given the rough state of the field I'm a bit suspicious of a process whose downstream QC never finds a problem. In any case, even if you can't afford to give them away at least the major trade show representatives should carry one. If a picture is worth a thousand words, then an actual physical object is worth exponentially more in persuasive ability.

One final thought. Given the rapidly changing nature of the business, many of these devices have very short lifetimes (in some cases because the company making them has a similarly short lifetime). I sincerely hope some museum is collecting examples of these, as they are important artifacts of today's technology. Plus, I really could imagine an art installation centered around some 3730 capillaries & Illumina bead arrays.

Wednesday, July 21, 2010

Distractions -- there's an app for that

Today I finally gave in to temptation & developed a Hello World application for my Droid. Okay, developed is a gross overstatement -- I successfully followed a recipe. But, it take a while to install the SDK & its plugin for the Eclipse environment plus the necessary device driver so I can debug stuff on my phone.

Since I purchased my Droid in November the idea of writing something for it has periodically tempted me. Indeed, one attraction of Scala (which I've done little with for weeks) was that it can be used to write Android apps, though it definitely means a new layer of comlexity. This week's caving in had two drivers.

First, Google last week announced a novice "you can write an app even if you can't program" tool called AppInventor. I rushed to try it out, only to find that they hadn't actually made it available but only a registration form. Supposedly they'll get back to you, but they haven't yet. Perhaps it's because I'm not an educator -- the form has lots of fields tilted at educators.

The second trigger is that an Android book I had requested came in at the library. Now, it's for a few versions back of the OS -- but certainly okay for a start (trying to keep public library collections current on technical stuff is a quixotic task in my opinion, though I do enjoy the fruits of the effort). So that was my train reading this mornign & it got me stoked. The book is certainly not much more than a starting springboard -- I'm debating buying one called "Advanced Android Programming" (or something close to that) or whether just to sponge off on-line resources.

The big question is what to do next. The general challenge is choosing between apps that don't do anything particularly sophisticated but are clearly doable vs. more interesting apps that might be a bit to take on -- especially given the challenge of operating a simulator for a device very unlike my laptop (accelerometers! GPS!). I have a bunch of ideas for silly games or demos, most of which shouldn't be too hard -- and then one concept that could be somewhat cool but also really pushing the envelope on difficulty.

It would be nice to come up with something practical for my work, but right now I haven't many ideas in that area. Given that most of the datasets I work with now are enormous, it's hard to see any point to trying to access them via phone. A tiny browser for the UCSC genome database has some appeal, but that's sounding a bit ambitious.

If I were still back at Codon Devices, I could definitely see some app opportunities, either to demo "tech cred" or really useful. For example, at one point we were developing (though an outsource vendor) a drag-and-drop gene design interface. The full version probably wouldn't be very app appropriate, but something along those lines could be envisioned -- call up any protein out of Entrez & have it codon optimized with appopropriate constraints & sent to the quoting system. In our terminal phase, it would have been very handy to have a phone app to browse metabolic databases such as KEGG or BioCyc.

That thought has suggested what I would develop if I were back in school. There is a certain amount of simple rote memorization that is either demanded or turns out to expedite later studies. For example, I really do feel you need to memorize the single letter IUPAC codes for nucleotides and amino acids. I remember having to memorize amino acid structures and the Krebs cycle and glycolysis and all sorts of organic synthesis reactions and so forth. I often devised either decks of flash cards or study sheets, which I would look at while standing in line for the cafeteria or other bits of solitary time. Some of those decks were a bit sophisticated -- for the pathways I remember making both compound-centric and reaction-centric cards for the same pathways. That sort of flashcard app could be quite valuable -- and perhaps even profitable if you could get students to try it out. I can't quite see myself committing to such a business, even as a side-line, so I'm okay with suggesting it here.

Tuesday, July 13, 2010

There are 2 styles of Excel reports: Mine & Wrong

A key discovery which is made by many programmers, both inside and outside bioinformatics, is that Microsoft Excel is very useful as a general framework for reporting to users. Unfortunately, many developers don't get beyond that discovery to think about how to use this to best advantage. I've developed some pretty strong opinions on this, which have been repeatedly tested recently by various files I've been sent. I've also used this mechanism repeatedly, with some Codon reports for which I am guilty of excessive pride.

An overriding principle for me is that I am probably going to use any report in Excel as a starting point for further analysis, not an endpoint. I'm going to do further work in Excel or import it into Spotfire (my preference) or JMP or R or another fine tool. Unfortunately, there are a lot of practices which frustrate this.

First, as much data as possible should be packed into as few tabs as practical. Unless you have a very good reason, don't put data formatted the same way into multiple files or multiple tabs. I recently got some sequencing results from a vendor and there was one file per amplicon per sample. I want one file per total project!

Second, the column headers need to be ready for import. That means a single row of column headers and every column has a specific and unique header. Yes, for viewing it sometimes looks better to have multiple rows and use cell fusing and other tricks to minimize repetition -- but for import this is a disaster either guaranteed or likely to happen.

Third, every row needs to tell as complete a story as possible. Again, don't go fusing cells! It looks good, but nobody downstream can tell that the second row really repeats the first N cells of the row above (because they are fused).

Fourth, don't worry about extra rows. One tool I use for analysis of Sanger data spits out a single row per sample with N columns, one column for each mutation. This is not a good format! Similarly, think very carefully before packing a lot into a single cell -- Excel is terrible for parsing that back out. Don't be afraid to create lots of columns & rows -- Excel is much better at hiding, filtering or consolidating than it is at parsing or expanding.

Finally, color or font coding can be useful -- but use it carefully and generally redundantly. Ignoring the careful part means generating confusing "angry fruit salad" displays (and never EVER make text blink in a report or slide!!!).

Follow these simple rules and you can make reports which are springboards for further exploration. It's also a good start to thinking about using Excel as a simple front end to SQL databases.

So what was so great about my Codon reports? Well, I had figured out how to generate the XML to handle a lot of nice features of the sort I've discussed above. The report had multiple tabs, each giving a different view or summary of the data. The top tab did break my rules -- it was a purely summary table & was not formatted for input into other tools (though now I'm feeling guilty about that; perhaps I should have had another tab with it properly formatted). But each additional tab stuck to the rules. All of them had AutoFilter already turned on and had carefully chosen highlighting when useful -- using a combination of cell color and text highlighting to emphasize key cells. Furthermore, it also hewed to my absolute dictum "Sequences must always be in a fixed width font!". I didn't have it automatically generate Pivot Tables; perhaps eventually I would have gotten there.

Monday, June 28, 2010

Knome Cofactors Ozzie Ozbourne's Genome

I got an email today pointing out that Cofactor Genomics will be responsible for generating Ozzie Osbourne's genome sequence for Knome. Said email, unsurprisingly, originated at Cofactor.

Now, before anyone accuses me of being a shill let me point out that (a) I get no compensation from any of these companies and (b) I've asked for multiple quotes from Cofactor in good faith but have yet to actually send them any business. Having once been in the role of quote-generator and knowing how frustrating it is, I have a certain sympathy & see spotlighting them as a certain degree of reasonable compensation.

Knome's publicity around the Osbourne project has highlighted his bodyy's inexplicable ability to remain functioning despite the tremendous chemical abuse Mr. Osbourne has inflicted on it. The claim is that his genome will shed light on this question. Given that there are no controls in the experiment and he is an N of one, I doubt anything particularly valuable scientifically will come of this. I'm sure there will be a bunch of interesting polymorphisms which can be cherry-picked -- we all carry them. Plus, the idea that this particular individual will change his life in response to some genomics finding is downright comical -- clearly this is not someone thinks before he leaps! It's about as likely as finding the secret to his survival in bat head extract.

Still, it is a brilliant bit of publicity for Knome. Knome could use the press given that the FDA is breathing down their necks along with the rest of the Direct to Consumer crowd. Getting some celebrities to spring for genomes will set them up for other business as their price keeps dropping. The masses buy the clothes, cars and jewelry worn by the glitterati, so why not the genome analysis? Illumina ran Glenn Close's genome, but Ozzy probably (sad to say) has a broader appeal across age groups.

Sunday, June 27, 2010

Filling a gap in the previous post

After thinking about my previous entry on PacBio's sample prep and variant sensitivity paper I realized there was a significant gap -- neither dealt with gaps.

Gaps, or indels, are very important. Not only are indels a significant form of polymorphism in genomes, but small indels are one route in which tumor suppressor genes can be knocked out in cancer.

Small indels have also tended to be troublesome in short read projects -- one cancer genome project missed a known indel until the data was reviewed manually. Unfortunately, no more details were given of this issue, such as the depth of coverage which was obtained and whether the problem lay in the alignment phase or in recognizing the indel. The Helicos platform (perhaps currently looking like it should have been named Icarus) had a significant false indel rate due to missed nucleotide incorporations ("dark bases"). Even SOLiD, whose two-base encoding should in theory enable highly accurate base calling, had only a 25% rate of indel verification in one study (all of these are covered in my review article).

Since PacBio, like Helicos, is a single molecule system it is expected that missed incorporations will be a problem. Some of these will perhaps be due to limits in their optical scheme but probably many will be due to contamination of their nucleotide mixes with unlabeled nucleotides. This isn't some knock on PacBio -- it's just that any unlabeled nucleotide contamination (either due to failure to label or loss of the label later) will be trouble -- even if they achieve 1 part in a million purity that still means a given run will have tens or hundreds of incorporations of unlabeled nucleotides.

There's another interesting twist to indels -- how do you assign a quality score to them? After all, a typical phred score is the probability that the called base is really wrong. But what if no base should have been called? Or another base called prior to this one? For large numbers of reads, you can calculate some sort of probability of seeing the same event by chance. But for small numbers, can you somehow sort the true from the false? One stab I (and I think others) have made at this is to give an indel a pseudo-phred score derived from the phred scores of the flanking bases -- the logic being that if those were strong calls then there probably isn't a missing base but if those calls were weak then your confidence in not skipping a base is poor. The function to use on those adjacent bases is a matter of taste & guesswork (at least for me) -- I've tried averaging the two or taking the minimum (or even computing over longer windows), but have never benchmarked things properly.

Some variant detection algorithms for short read data deal with indels (e.g. VarScan) and some don't (e.g. SNVMix). Ideally they all would. There's also a nice paper on the subject that unfortunately didn't leave lazy folks like me their actual code.

So, in conclusion I'd love to see PacBio (and anyone else introducing a new platform) perform a similar analysis of their circular consensus sequencing on a small amount of a 1 nucleotide indel variant spiked into varying (but large) backgrounds of wildtype sequence. Indeed, ideally there would be a standard set of challenges which the scientific community would insist that every new sequencing platform either publish results on or confess inadequacy for the task. As I suggested in the previous post, dealing with extremes of %GC, hairpins, mono/di/tri/tetranucleotide repeats should be included, along with the single nucleotide substitution and single nucleotide deletion mixtures. To me these would be far more valuable in the near term than the sorts of criteria in the sequencing X-prize for exquisite accuracy. Those are some amazing specs (which I am on record as being skeptical they will be met anytime soon). What we trying to match platforms with experiments (and budgets) really need is the nitty gritty details of what works and what doesn't at what cost (or coverage).

Wednesday, June 23, 2010

PacBio oiBcaP PacBio oiBcaP

PacBio has a paper in Nucleic Acid Research giving a few more details on sample prep and some limited sequencing quality data on their platform.

The gist (which was largely in the previous Science paper) is that the templates are prepared by ligating a hairpin structure on to each end. In this paper they focused on a PCR product, with the primers designed with type IIS (BsaI) restriction sites flanking the insert. Digestion yields sticky ends (and with BsaI, these can be designed to be non-symmetric to discourage concatamerization) which enable ligating the adapters. Of course, for applying this on a large scale you do run into the problem of avoiding the restriction sites. There are other approaches, such as uracil-laden primer segments which can be specifically destroyed the DUT and UNG enzymes -- the only catch being that many of the most accurate polymerases can be poisoned by uracil. A couple of other schemes have either been proven or can be sketched quickly.

In any case, once these molecules are formed they have the nice property of being topologically circular and having the insert sequence represented twice (once in reverse complement) within that circle, with the two pieces separated by the known linker sequence. So if the polymerase lasts to go round-and-around, then each pass gives another crack at the sequence. For short templates, this gives lots of rounds and even on a longer (1Kb) PCR product they successfully had three bites at the apple. Feed these into a probabilistic aligner and each pass improves the quality values. Running the same template many times enabled calibrating the quality values (which are Phred-scaled), showing them to be a reasonable estimator of the likelihood of miscalling the consensus.

Given that eventually their polymerase dies on a template, they point out that the length of a fragment is highly predictive of the number of passes which can be observed which in turn implies the sequencing quality which will be obtained for a template. This makes for an interesting variation on the usual "what size library" question in sequencing. For sequencing a bacterial genome, it may make sense to go very deep with short fragments to get high quality pieces which can then be scaffolded using much longer individual reads without the circular consensus effect as well as the spaced "strobe" reads to provide very long range scaffolding. For a bacterial genome & the claims made of capacity per SMRT cell, it would seem that making 2 libraries (one short for consensus sequencing, one long for the others) and burning perhaps 4-5 $100 SMRT cells, one could get a very good bacterial genome draft. For targeted sequencing by PCR, the approach is quite attractive. Amplicon length becomes an important variable for final quality. Appropriate matching of the upstream multi-PCR technology (such as RainDance or Fluidigm or just lots of conventional PCRs) with PacBio for capacity will be necessary -- and hard numbers on throughput/SMRT cell are desperately needed!

One side thought: is there a risk with strobe sequencing of accidentally going off the end and reading back along the same insert -- but not realizing it since the flip turn in the hairpin was done in the dark? It would suggest that libraries for strobe sequencing need to be very stringently sized.

One curiosity of their plot of this phenomenon is that the values appear to be asymptotically approaching phred 40 -- an error rate of 1 in 10,000. Is this really where things top out? That's a good quality -- but for some applications possibly not good enough.

They go on to apply this to measuring the allele frequency of a SNP in defined mixtures. Given several thousand reads and the circular consensus, they succeeded at this -- even when the minor allele was present at a frequency of 2.5%. This isn't as far as some groups have pushed 2nd generation sequencing for rare allele detection (such as finding a cancer mutation in a background of normal DNA), but is certainly in the range of many schemes used in this context (such as real-time PCR). If my inference of a maximum phred score is correct, it would tend to limit the sensitivity for rare mutation detection somewhere in the parts-per-thousand range; the record is around this (1 in 0.1% reported by Yauch et al)

My major concern with this paper is the lack of a breadth of substrates. Are there template properties which give trouble to the system? The obvious candidates are extremes of composition, secondary structures and simple repeats. Will their polymerase power through 80+% GC? Complex hairpins? Will stuttering be observed on long simple repeats? Can long mononucleotide runs be measured accurately?

In the end, the key uses for the first release PacBio system will depend on where these problems are and those throughput numbers. This paper is a useful bit of information, but much more is needed to determine when PacBio is either cost-effective for an application (vs. other sequencing or non-sequencing competitors) or when the performance advantages in that application (such as turn-around time) push them to the fore. Personally, I think it is a mistake on PacBio's part not to be literally flooding the world with data. Or at least the bioinformatics world -- only when they start releasing data to the broad developer community will there be the critical tweaking of existing tools and development of new tools to really take advantage of this platform and also cope with whatever weaknesses it possesses.

Travers, K., Chin, C., Rank, D., Eid, J., & Turner, S. (2010). A flexible and efficient template format for circular consensus sequencing and SNP detection Nucleic Acids Research DOI: 10.1093/nar/gkq543

Monday, June 07, 2010

What's Gnu in Sequencing?

The latest company to make a splashy debut is GnuBio, a startup out of Harvard which gave a presentation at the recent Personal Genomics conference here in Boston. Today's Globe Business section had a piece, Bio-IT World covered it & also Technology Review. Last week Mass High Tech & In Sequence (subscription required) each a bit too.

GnuBio has some grand plans, which are really in two areas. For me the more interesting one is the instrument. The claim which they are making, with a attestation of plausibility from George Church (who is on their SAB, as is the case with about half of the sequencing instrument companies), is that a 30X human genome will be $30 in reagents on a $50K machine (library construction costs omitted, as is unfortunately routine in this business). The key technology from what I've heard is the microfluidic manipulation of sequencing reactions in picoliter droplets. This is similar to RainDance, which has commercialized technology out of the same group. The description I heard from someone who attended the conference is that GnuBio is planning to perform cyclic sequencing by synthesis within the droplets; this will allow miniscule reagent consumption and therefore low costs.

It's audacious & if they really can change out reactions within the picoliter droplets, technically it is quite a feat. From my imagination springs a vision of droplets running a racetrack, alternately getting reagents and being optically scanned for which base came next & an optical barcode on each droplet. I haven't seen this description, but I think it fits within what I have heard.

Atop those claims comes another one: despite having not yet read a base with the system, by year end two partners will have beta systems. It will be amazing to get proof-of-concept sequencing, let alone have an instrument shippable to a beta customer (this also assumes serious funding, which apparently they haven't yet found. Furthermore, it would be stunning to get reads long enough to do any useful human genome sequencing even after the machine starts making reads, let alone enough for 30X coverage.

The Technology Review article, a journal I once read regularly and had significant respect for, is a depressingly full of sloppy journalism & failure to understand the topic. One paragraph has two doozies

Because the droplets are so small, they require much smaller volumes of the chemicals used in the sequencing reaction than do current technologies. These reagents comprise the major cost of sequencing, and most estimates of the cost to sequence a human genome with a particular technology are calculated using the cost of the chemicals. Based solely on reagents, Weitz estimates that they will be able to sequence a human genome 30 times for $30. (Because sequencing is prone to errors, scientist must sequence a number of times to generate an accurate read.)

The first problem here is that yes, the reagents are currently the dominant cost. But, if library construction costs are somewhere in the $200-500 range, then after you drop reagents greatly below that cost then it's a bit dishonest to tout (and poor journalism to repeat) a $30/human genome figure. Now, perhaps they have a library prep trick up their sleeve or perhaps they can somehow go with a Helicos-style "look Ma no library construction" scheme. Since they have apparently not settled on a chemistry (which will also almost certainly impose technology licensing costs -- or developing a brand new chemistry -- or getting the Polonator chemistry, which is touted as license-free), anything is possible -- but I'd generally bet this will be a clonal sequencing scheme requiring in-droplet PCR. The second whopper there is the claim that the 30X coverage is needed for error detection. It certainly doesn't hurt, but even with perfect reads you still need to oversample just to have good odds of seeing both alleles in a diploid genome.

Just a little alter in the story is the claim "The current cost to sequence a human genome is just a few thousand dollars, though companies that perform the service charge $20,000 to $48,000", which confuses what one company (Complete Genomics) may have achieved with what all companies can achieve.

The other half of the business plan I find even less appealing. They are planning to offer anyone a deal: pay your own way or let us do it, but if we do it we get full use of the data after some time period. The thought is that by building a huge database of annotated sequence samples, a business around biomarker discovery can be built. This business plan has of course been tried multiple times (Incyte, GeneLogic, etc.) and has worked in the past.

Personally, I think whomever is buying into this plan is deluding themselves in a huge way. First, while some of the articles seem to be confident this scheme won't violate the consent agreements on samples, it's a huge step from letting one institution work with a sample to letting a huge consortium get full access to potentially deidentifying data. Second, without good annotation the sequence is utterly worthless for biomarker discovery; even with great annotation randomly collected data is going to be challenging to convert into something useful. Plus, any large scale distribution of such data will butt up against the widely accepted provision that subjects (or their heirs) can withdraw consent at any time.

The dream gets (in my opinion) just daffier beyond that -- subjects will be able to be in a social network which will notify them when their samples are used for studies. Yes, that might be something that will appeal to a few donors, but will it really push someone from not donating to donating? It's going to be expensive to set up & potentially a privacy leakage mechanism. In any case, it's very hard to see how that is going to bring in more cash.

My personal advice to the company is many-fold. First, ditch all those crazy plans around forming a biomarker discovery effort; focus on building a good tech (and probably selling it to an established player). Second, focus on RNA-Seq as your initial application -- this is far less demanding in terms of read length & will allow you to start selling instruments (or at least generating data) much sooner, giving you credibility. Of course, without some huge drops the cost of library construction will be dwarfing that $30 in reagent, perhaps by a factor of 10X. A clever solution there using the same picodroplet technology will be needed to really get the cost of a genome to low levels -- and could be cross-sold to the other platforms (and again, perhaps a source of a revenue stream while you work out the bugs in the sequencing scheme).

Finally, if you really can do an RNA-Seq run for $30 a run in total operating costs, could you drop an instrument by my shop?

Saturday, June 05, 2010

Trying to Kick the Bullet

I know the title looks like a malapropism, but it isn't. What I'm trying to do is wean myself away from bulleted lists in PowerPoint.

I have a complex relationship with PowerPoint, as with the other tools in Microsoft Office. Each has real value but has also been seriously junked up by the wizards of Redmond. Far too much time is spent investing the tool with features that are rarely or never of value.

Edward Tufte, whom I admire greatly, takes a far more negative view of PowerPoint. I'm a fan of Tufte's; not an acolyte. PowerPoint can be very useful, if you use it carefully. But, I'm always open to considering how I might improve how I use it. After looking through some of Tufte's specific criticisms of bulleted lists, I realize here is an opportunity to make a change.

Now, I am a heavy user of bulleted lists. I often think in hierarchies & outlines, which fits them well. I also find it challenging to draw complex diagrams in a manner which is both presentable & useful, at least in reasonable time. So I often write many slides of bulleted lists & then try to go back and decorate them with relevant & informative diagrams I can lift from various other sources (or from previous slides). I do spend some time designing a few careful diagrams using the tools in PowerPoint.

So what is wrong with bulleted lists? As Tufte points out, the standard Microsoft scheme uses four or more different attributes to display levels of hierarchy. First, inferior levels are more indented than their parents. Second, the type size is changed. Third, the type face is changed to a different family or italicized (or both). Fourth, the character used for the bullet is changed. On top of Tufte's sharp criticism, there is the pointed satire as found in The Gettysburg PowerPoint Address.

Now, I find some of this useful, but it is a good wakeup that some of it I have just been accepting. After all, do I really need the actual bullets? Rarely are they actually showing anything useful -- the one exception being when I change them around in one list to show a useful attribute (such as checks vs. X's ). Another variant is using a numbered list to emphasize a critical order of points, such as in a series of steps which must be executed in order. But, most bullets are just consuming valuable slide real estate without adding value.

I also find the indents useful to show hierarchy. And having a smaller typeface is useful since the upper levels are more on the order of headlines and the lower levels often details -- so a smaller face allows me to pack more in.

But the change in font family or italicization? Those aren't very useful either. I'd much rather save italics for emphasis.

The challenge is actually putting this into practice. I have started going through active slide decks and converting them to the reduced list scheme. I don't see myself giving up hierarchical lists, but I'll try to do better within that structure.