Monday, January 31, 2011

Three Visions of The Next Phase of Sequencing

When thinking about the current crop of sequencing instruments, it occurred to me that they can be divided roughly into three categories, representing three different visions of what sequencing will look like for the next few years.

In the first category you have the monster instruments. I will use as my exemplar Pacific Biosciences, though this category would also include Helicos and to some degree the mainline sequencers of the other families. These instruments are large and require their own floor space (or effectively so). Think PDP-11 as an analogy.

In the second category are the "personal sequencers". The archetype is the Ion Torrent Personal Genome Machine (PGM), but with similar form factors are the 454 Jr and MiSeq. These instruments sit atop a bit of bench space. The PGM in particular is meant to be seen: a stylish logo composed of four symbols and even an iPod dock (what next? Personalized "skins" for your PGM?). Think personal computer as an analogy (PGM is a Mac & MiSeq a PC?).

Luke Jostins has a very nice article looking at what Oxford Nanopore has released about their instrumentation. Now, the big catch is that any working chemistry to go with that instrument is still tightly under wraps (and someone from OxNano smothered speculation that any big announcement will come at Marco Island this week). What is striking about their instrument is that while it could theoretically sit on a benchtop, what they are really aiming for are huge gangs of them -- the boxes fit in a standard computer rack. Talk about a data center!

This isn't just a mounting decision; OxNano is thinking in terms of the unique properties nanopores would bring. For example, there wouldn't be set cycles -- so each unit can read from a sample until some specified condition is met (read depth, found a given mutation, etc). The units are also designed to talk to each other, enabling large jobs to truly be spread over multiple units (SeqHadoop? The OxNano Collective -- your sequences will be assimilated?).

Fundamentally, this is a very different vision of the future of sequencing than PGM. Very few biological instruments I've seen are designed to be packed en masse efficiently; Codon had stacks and stacks of off-the-shelf thermocyclers but their loading designs insisted on a racks that wasted lots of space. Such packing implies a large, centralized sequencing center. Each unit can take 96 samples, much as the PacBio can (and presumably OxNano will conveniently take 96-well plates, though in that form factor it isn't obvious how). Contrast this with the PGM, which requires each chip to be manually locked into the instrument. Perhaps PGM will later get a Big Brother, but for now it is clearly aimed at sequencing as a cottage industry.

Which gets back to an ongoing debate in science -- should science be done on a huge scale or by small teams. Should sequencers be present on every bench like a microcentrifuge or concentrated in big centers like a supercollider (or even like the autoclaves in an academic complex)? If sequencing is to become a routine and integral part of healthcare, will it be performed in doctors' offices or central labs? These will all be ongoing questions.

Of course, none of these reach the form factor we all dream of -- something the size & convenience of a tricorder.

Thursday, January 27, 2011

AGBT Around the Corner

In less than a week, the AGBT meeting (perhaps better known as Marco Island) begins. Alas, due to my usual dithering around going to meetings I'm not registered -- and I've been told it sold out the day registration opened. Barring some passes showing up on StubHub, I'll be stuck monitoring Twitter again.

AGBT can be a locale for making big announcements -- last year the meeting finished with Jonathon Rothberg unveiling (well, sort of) the Ion Torrent PGM and the year before PacBio was the star. But, I'm not expecting a lot of big announcements from the major players. The recent J.P. Morgan conference was the scene of Illumina's big MiSeq announcement and Ion Torrent's launch of their second generation chip. It's hard to see either of those groups making further announcements. Nor does either have a big talk to launch things, though I'm sure many vendors will be making announcements even without a podium spot. There is one talk featuring Ion Torrent from one of their early access sites, and I'm sure that will be popular for trying to get some ideas as to what really working with the instrument is like.

As far as the other established players, Roche/454 desperately needs to generate some excitement with some advancement; the long-heralded 1Kb reads might do. PacBio has two talks, one on the cholera work and one on detecting modified bases. Perhaps some more details on performance will leak out, but neither sounds promising. Complete Genomics has a workshop but no apparent talk.

The biggest opportunity for metaphorical fireworks (to go with the ones on the beach one night -- I love fireworks; absolutely must register next year!) would be if one of the nanopore or other blue sky technologies actually demonstrated real data. I'd say any read 10 bases or longer would be something, and demonstrating a set of reads in the 20+ basepair range would mean they are really on to something. There's a handful of talks scheduled for such technologies (Oxford Nanopore's founder, Nabsys, UC Santa Cruz on nanopores, BioNanomatrix; apologies if I missed someone). No talk from GnuBio; would be nice to see whether they are at all on their aggressive timeline.

I will be watching Twitter, so if you are there and can tweet it will be appreciated. I did try to follow ASHG in November on Twitter and got a bit frustrated with the standard web interface. So, today I whipped up a quick Perl harness to search for tweets and process them into a table. Most of the heavy lifting, of course, was done by a CPAN library & so I can actually put the code below in case anyone else finds it useful. Once the meeting starts, I might hack some more on it. My dream app would know the conference schedule and assign each tweet to a talk, but that's way beyond what I see myself getting together. Perhaps I'll actually go so far as to generate a SQLite database from the tweets of interest, but most likely it will just feed Excel.

Currently what it does is pull things like hashtags and URLs out of the text & put them in separate columns. The first column is the timestamp and the second is the author. Also, it skips over explicit re-tweets. One issue I may try to deal with is posts which are really re-tweets but not tagged correctly; if those are a nuisance I may try to work on some auto text-clustering. And perhaps I won't resist at least a little crude assignment of tweets to talks (looking for keywords; tracking a given user).

use Net::Twitter::Lite;

my $nt = Net::Twitter::Lite->new(
username => $ENV{'TWITTER_USERNAME'},
password => $ENV{'TWITTER_PASSWORD'}

my $r=$nt->search($ARGV[0]);
my $nt = Net::Twitter::Lite->new(
username => $ENV{'TWITTER_USERNAME'},
password => $ENV{'TWITTER_PASSWORD'}

my $r=$nt->search("Ion Torrent");

foreach my $status(@{$r->{'results'}})
my $origText=$status->{'text'};
next if ($origText=~/^RT/);
my $text=$origText;
my $url="";
if ($text=~m/(http:[^ ]+)/)
$text=~s/ ?$url//;
my @hashes="";
while ($text=~m/ \#([^ ]+)/g)
$text=~s/\ #([^ ]+)//g; $text=~s/ +$//;
print join("\t",$status->{'created_at'},

Monday, January 24, 2011

Yet Another Sequencing Service Provider

GenomeWeb and Matthew Herper both covered today's announcement by Perkin Elmer that they are launching a very sophisticated second generation sequencing service. Perkin Elmer becomes a rare large, publically-traded corporation in the market, though Beckman Coulter and Illumina are significant players as well (and in their specialized manner, Complete Genomics too).

Around a year ago I felt I had spoken to every U.S. based commercial provider of second generation sequencing, as well as pretty much every Boston or Cambridge core lab offering services and a few others (such as my alma mater). I've long since given up trying to maintain that status. Barriers to entry in the field seem to be minuscule, leading to new entrants on a regular basis. Many can be found through the dedicated thread at SeqAnswers or the Second Generation Sequencer Map, but just Googling for "sequencing service provider" tends to get a few new hits. Spread your scope to Europe or the rest of the world and the number of providers climbs further.

Given such a crowded field, what I would recommend to providers is they start thinking about differentiating themselves. How? Well, look for what the other providers don't offer.

For example, nobody is yet offering a low-minimum cost rapid turnaround service. Now, partly this is because platforms to support such have only recently become available. However, the one platform that is somewhat amenable to this (454 Jr) has only one service provider I've heard of, and they are offering 4 week turnaround times -- not what I call fast. For fast, I'm thinking more on the order of a week or less. On cost, there really need to be some offerings in the sub $5K range. Yes, you can buy individual lanes from some labs (particularly core facilities) for just over $1K each, but you'll need to wait until someone else fills a plate. There is a real need for some offerings using Ion Torrent, PacBio and/or MiSeq in this sort of role.

Another potential area of differentiation is sample prep. Most service providers I've talked to still want 5-10 micrograms of input DNA for their sequencing. Want to stand out? Start specializing in 50-100 nanogram input amounts. Or formalin fixed paraffin embedded samples, which most clinical folks need to deal with. There aren't many providers using the fancier PCR-based targeting schemes (Fluidigm and Raindance) nor many I know of offer automated size selection.

It is also interesting to see that new providers tend to be either purely Illumina or offering Illumina plus some other platforms. Indeed, some of the providers who previously were strong SOLiD or 454 shops seem to all be getting into the Illumina side. Another sign that Illumina is leading the pack; let's hope that continues to be good for the industry.

If you want to hear more of my thoughts on this matter, I'll be giving a talk on this topic during the last day of Bio-IT World (indeed, in the penultimate slot of the Next Generation Sequencing Informatics track).

Sunday, January 23, 2011

Time to Recognize A New Norm in Scientific Review!

The 20 January issue of Nature has a news article "Trial by Twitter" exploring the issue of how scientists should deal with comments on their papers on the Internet. The article covers a lot of ground but I'd like to deal with a little of it. Also, it is a news article and tries hard to be neutral; I will not be as neutral.

Part of the article touches on two life sciences papers from last year which attracted significant controversy: a genome-wide association study (GWAS) on human longevity and the claim of arsenic-incorporating bacteria. In both cases, the authors initially declined to respond to pointed critiques in blogs. I won't mince words where I stand on this: claiming it is "premature for us to talk about our experience because this is still an ongoing issue" (the GWAS authors) or "any discourse will have to be peer reviewed in the same manner as our paper was" (the arsenic authors) is stonewalling against the interest of science and should be unacceptable in the community of scientists. It is particularly ill advised to make such a statement after, as the arsenic bacterium authors did, engaging explicit courting of the popular press to publicize the work as significant and well-executed.

That said, one cannot expect scientists to know of every tweet or reference to their paper. Nor should scientists have an obligation to address comments which have no basis in science nor engage persons engaged in personal attacks. Obviously that leaves a lot of discretion -- but discretion is different than saying "all roads go through further formal peer review". It is also unacceptable to hide behind the fact that the original paper passed peer review; as a number of papers have demonstrated, peer review often fails for various reasons. The Reactome paper is a particularly painful example of this for me, since I swallowed it whole and later realized that I (and apparently the chosen reviewers) lacked the expertise to spot glaring issues in the paper. But, I do believe that scientists do have an obligation to respond to detailed, reasoned, scientific critiques of their work.

Who should decide which critiques to respond to? The article talks a bit about post-review scientific scoring systems such as the Faculty of 1000. These are useful, but not necessarily the best way to find the most energetic and informed reviewers for a paper. Among those bloggers who comment on a paper, you are likely to find such individuals, though not with a perfect signal-to-noise ratio. Given that traditional journals are constantly struggling to find ways to stay relevant in an Internet world, to me it is the editors at the journal a paper was published in are obvious candidates for selecting those critiques demanding response. Of course, the authors of the papers themselves should also play this role.

The article discusses how poor the response has been to commenting facilities at journal websites which enable such commentary. Now, my one interaction with such a site did lead to a peer-reviewed version of my blog post. But, that is the only time I've actually tried to contact the editors of a paper, and only because I felt so strongly. Editors and authors need to be more proactive, with the expected norm being that they actively look for intelligent commentary on papers they edited or authored. Resources such as Research Blogging can help find these; indeed, I would argue that any editor who doesn't scan Research Blogging for coverage of papers in their journal should immediately start doing so.

The point of scientific publishing and peer review is not to protect reputations and not to promote orthodoxy; the point of these is to attempt to ensure that good science is made better and bad science is swept clean as soon as possible. The existing system of formal review represents an approach to this goal which evolved over time; it is neither perfect nor unquestionable. The time is now for a new ethos in science in which any reasoned source of scientific criticism is accepted and can expect response.

Thursday, January 20, 2011

The Emperor of All Maladies

I finished yesterday Siddhartha Mukherjee's The Emperor of All Maladies. With the exception of one shocking lapse, it's a very good history of cancer therapy and I strongly recommend it for anyone interested in cancer.

Mukherjee is an oncologist and interleaves his historical view of cancer with the experiences of selected patients he tended. This humanizes the subject and brings useful context. On the historical side, he reaches all the way back to the first written description of cancer, from Pharonic Egypt (which was also the first expression of futility at treating cancer). A major thread running through the early part of the book is the career of Sidney Farber, who pioneered chemotherapy, and Mary Lasker, who reshaped the role of the U.S. government in cancer research despite having never held elective office.

The book is written for a lay audience and I think he does a good job. There are some key lessons which need to be heard widely. Mukherjee describes many of the twists and turns, the clever leaps of logic and the leaps that failed. For example, Farber's key insight was that leukemia was characterized by improperly functioning bone marrow, a trait shared with several nutritional deficiencies which had just been solved. So he tried treating childhood leukemias (deemed utterly untreatable by oncologists of the time) with B-vitamins, to disastrous results. But, then the second leap occurred -- Farber tried B-vitamin antagonists, and soon found success.

Mukherjee also captures many of the missed opportunities and the non-scientific barriers to progress. Farber was shunned by many colleagues because he was not an oncologist, but a pathologist. The initial misstep led to utter non-cooperation from the oncologists, leaving Farber's doctors to sharpen their own needles and dump their patient's bedpans. His supply of anti-folates came from an immigrant doctor who had left for chemistry and industry after his foreign credentials proved useless. Much more recently, there is the story of how Weinberg's group discovered Her-2 but didn't contemplate pursuing therapy against it. Far worse is how the lawyers at the institute Farber founded nearly iced Gleevec, and even after that issue was dealt with (when Gleevec's proponent Brian Druker moved to another institution) the management of Novartis nearly killed it. Similarly, Genentech all but dropped Herceptin; again it was an outside oncologist who drove the project to success.

The book also gives a good overview of how many things needed to be invented along the way and issues which arise. Simple epidemiology noted the high incidence of scrotal cancer in boy chimney sweeps; the intersection of a rare cancer and rare occupation made the relationship unquestionable. But later, researchers were faced with the challenge of proving a common cancer (lung) was linked to a common environmental factor (smoking) and this required new methods such as the case-control study. In the chemotherapeutic arena, the book covers many iterations of trial design and testing strategy and also brushes a bit on the changing ethical landscape.

If you've read my previous book reviews, a typical exercise for me is to ask what else could have gone in. Now, this is admittedly sometimes unfair to the author and one prior reviewee has politely chided me that some of what I missed ended up on the editor's floor. Plus, books that want to be read and not doorstops need to respect certain length limits. But, when I am writing I find it a useful exercise; only when we consider the whole range of possibilities can we be confident that the correct balance has been reached.

Perhaps the most surprising area barely touched on is angiogenesis and anti-angiogenic therapy. Some of this may have to do with timing; most of the book concerns events before about 2004, though with an impressive quick coverage of some of the learnings from cancer genomes. So many of the travails of Avastin would be after that rough divide. Still, this is a hot topic and perhaps that is why I miss it from the book. This book deserves to be widely read and probably will be (it is certainly the de facto Book-of-the-Month at work) and will therefore form the foundation for many public conversations about cancer

Another topic essentially absent from the book are various types of immunotherapy. Again, this is an area having some resurgence (with the approval of the prostate cancer vaccine Provenge), but it isn't clear how important it will be long term. I think there was a passing mention of Coley's toxins, but saw none of the late 70's excitement around interferons (which, alas, became only a niche player in oncology) or the 80's focus on interleukins.

Finally, though there is some coverage of the world of molecular analysis of cancers (such as cancer genomics), the field of microarray classification of cancer and guiding therapy isn't explored. This is a pretty complex topic with a lot of shifting (and, as noted in the recent George Poste Nature opinion piece, far more smoke than light) so it's a bit understandable it was left out.

Overall though, I can't think of anything that is clearly missing. But now my quibble and serious complaint. Certainly the book not only filled in a lot of areas I just hadn't been exposed to, but even had me running for articles very close to where I see my core training.

The quibble is a bit of a pet peeve. The book, like many (and myself), roughly divides cancer chemotherapeutics into two bins (with room for a small "unclassified" bin as well). Cytotoxics are broad-spectrum cell killers and constitute the bulk of cancer therapy agents.

Targeted therapies are the sticking point. For me, it is important to reserve this category for agents that have two critical properties: we know what the drug acts on in cells and we know something about the relevance of that mechanism to a tumor. Mukherjee gives a number of interesting stories about targeted therapies. For example, anti-estrogen therapy for breast cancer can trace back to a doctor hearing from Scottish shepherds that removing a ewe's ovaries would cause their udders to shrink. Another fascinating story relayed in the book shows how these categories can be tricky. By undescribed means (probably random screening), it had been found that cis-retinoic acid could be used to treat the aggressive leukemia APL, though with highly variable results. An inspired leap of logic led to the decision to test all-trans retinoic acid (ATRA) in APL, with stunningly successful results. Only later was it discovered that most APL cases are driven by a fusion protein derived from a retinoic acid receptor. So ATRA started as "other" and only later fits my definition of targeted therapy.

So what's my beef? It's when Mukherjee describes Velcade, thalidomide and Revlimid in multiple myeloma as targeted therapies. He's hardly alone; this is a common description. But in my taxonomy, Velcade is a cytotoxic. We know where it acts in the cell but not why that is important in cancer and not how to select which patients to use it in. Thalidomide and Revlimid probably should go into the "other" bucket; we're not sure of the mechanism but they aren't generically cytotoxic. Even one of the drugs I currently work on (an HSP90 inhibitor) I would generally call a cytotoxic; it's not a perjorative in my book. On the other hand, in one subset of lung cancer there is a strong hypothesis as to how such HSP90 inhibition works at a molecular level. So as with many biological classifications, they're a bit smudgy -- but I still think they are useful and worth being precise about.

Finally, the big complaint. One of the longer stories in the book concerns the apogee of intensive chemotherapy, in which breast cancer patients were given doses high enough to utterly destroy their bone marrow, followed by bone marrow transplants. This was an important and controversial approach to therapy, with patients begging to get into trials and fighting legal battles to have their insurance companies pay for these unproven treatments. Indeed, Masschusetts was one state which enacted legislation to mandate coverage of this particular treatment. However, when the clinical trial results rolled in, all but one large study showed no benefit. The outlier study showed a huge benefit. However, on closer inspection the outlier turned out to a fraud of truly monstrous proportions. In discussing that denouement, Mukerjee points out that a male patient "obviously" couldn't have been a legitimate member of the trials. Breast cancer in men is rare, but rare is not impossible and it is particularly critical for oncologists to not have that blind spot. Indeed, this is particularly important in families with a history of breast cancer; the risk of male breast cancer is much higher in BRCA1/2 familes.

That one blemish aside, though, I can strongly recommend this book. As I've noted before, if many read it then there will be a much stronger general basis for discussing cancer and the public policy around it.

Thursday, January 13, 2011

Whither Ion Torrent?

In a comment on yesterday's piece, Matthew Herper asks how the Ion Torrent might be better than Illumina's MiSeq. That's a head-scratcher, and I'm sure the Ion Torrent folks are scrambling to figure out what to tell people.

The problem is, looking at it honestly, is that Illumina is claiming they'll beat or meet Ion Torrent on every dimension. Similar startup cost (once you thrown in necessary gadgets for the Ion Torrent), similar run cost, similar raw read lengths (but paired!), similar numbers of reads, double the total output, faster runs and less hands on time or trouble. Did I leave anything out?

So where does Ion Torrent go to keep from being stillborn? Now, they say they have 60 instruments sold, which isn't quite stillborn -- but will MiSeq choke off further growth?

The one potential gap is that Ion Torrent is (almost) here now whereas MiSeq won't show up for about 6 months. That's not a big window, but some folks will be anxious enough that they'll want their sequencer NOW. But, how many sequencers can they really churn out without quality going haywire? The last thing Ion Torrent needs to put itself behind a reputational eight ball. Of course, pushing hard to place sequencers could pay off huge if Illumina is delayed in rolling out MiSeq -- or can't meet demand.

Upfront cost similarity is for one sequencer. For a variety of reasons, some folks will want more than one (e.g. to make sure at least one is already operating). Presumably the $50K increment starts applying here, though it's not clear how many Ion Torrent's can be supported by that $50K-$75K in auxiliary gear. And, it's sobering that you don't beat MiSeq on throughput until you buy the third one. Indeed, for similar amounts you could buy 5 Ion Torrents or 2 MiSeq and have similar throughput.

Giving away sequencers would be another way to grab an edge -- or cut the price even lower. Obviously, they could start with complimentary placements with a few key genomics bloggers (grin). Seriously, this is not an unprecedented model. Kodak took the world by storm giving away cameras -- processing & reload on the original Brownie was doable only at Kodak. This would be an expensive strategy, but would presumably tie down a bunch of users from going with the competition.

How quickly can Ion Torrent up their specs? That would be the best way to win -- up Illumina's ante in a huge way. This is what their contests are around, but they need the boost now. Again, it is sobering what a challenge has been put before them. Illumina has roughly 2X the read length, though as paired ends rather than one read -- but for many applications paired ends are equivalent or superior (I have one tiny one where you need a monolithic read, but I don't see it as a "killer app"). Longer reads, requiring more cycles, would also exacerbate the speed issue -- putting even more pressure on shortening cycle times.

The sample prep issue is potentially a tough nut. Illumina's slurp of Epicentre means quick-and-easy fragment library preps. For PCR amplicons, any platform will do. But, emulsion PCR seems to be something nobody loves. I haven't seen it done & I think some folks see the trouble as overblown -- but have you ever heard someone say "I love the smell of emulsion breaking in the morning! It smells of victory!"? Illumina's bridge PCR approach certainly looks easy. Can Ion Torrent get access to some non-emulsion clonal amplification (rolling circle?) and get it in place quickly enough?

The ultimate leap past Illumina would be to roll out a new chip beyond the 2nd generation one Ion Torrent just announced. Presumably Illumina has MiSeq's specs near the limits of what the great masses of users will be able to routinely get, but perhaps they left some slack in -- but 5X slack? If Ion Torrent could launch another 10X in number of sensors -- or quickly roll out a way to get more of those sensors working -- that could keep people excited.

Ion Torrent might also try some sort of ante upping in integrating with another player or a novel sample prep approach. Nothing leaps to mind as really fitting the bill, but perhaps something is out there. Hard to beat low input fragment libraries with just a bunch of pipetting, but who knows?

Finally, there is pure cool factor. Ion Torrent had it, iPod dock and all. Apple makes a mint selling cool stuff (with very slick user interfaces and integration) at a premium price. I doubt this will work in the scientific arena, but who knows?

Yep, it's quite a rabbit which Jay Flatley pulled out of his hat this week. Knocking your opponent off-balance before they actually rolled out is something executives must dream of. I think Ion Torrent will survive (and hope it will; Illumina continuing to expand its 70% market share really wouldn't be good for the market), but their trajectory is looking a lot flatter at the moment.

Wednesday, January 12, 2011

My, Oh My, Oh MiSeq!!

I was supposed to be going off to sleep after posting my latest on Ion Torrent but made the mistake of checking my RSS feed -- and there was a new item from Forbes' Matthew Herper (who, I was thrilled to find mentioning this space last week) about a new system from Illumina (announced this evening, hence I get to scoop GenomeWeb!). So much for a quick exit to bed (luckily, I'm working home tomorrow -- Boston area expecting a vicious winter storm).

The new system is called MiSeq and it is clearly a direct challenge to Ion Torrent as well as the 454 Jr. According to the press release, it will be priced around $125K with run costs in the $400-$750 range. What's really stunning are the specs: 6.8M 2x150 paired end reads in 27 hours -- including cluster generation!! That's really astounding! Plus it needs only 2 square feet of bench space for the entire setup.

Illumina's chemistry is well established, and (as I really appreciate having just done a SOLiD project) they really have the bulk of the academic informatics mind share as well. Paired end reads are really valuable in a number of contexts, or with short targets (e.g. PCR amplicons) they can be treated close to a single read of somewhere less than 300bp.

The time is also amazing compared to HiSeq, which is spec-ed at 8 days for 2x150 paired end reads. Now, some (all?) of this speedup could simply be through shortening the imaging time; since the flowcells for MiSeq are much smaller, the time to scan them could be similarly reduced. I don't know how much time per base cycle is in scanning vs. chemistry, but it seems like a lot of time could be saved. Alas, if this is the answer (as opposed to clever speeding up of the chemistry), then it doesn't translate to faster run times for HiSeq.

The only catch is the system is being announced now; actual machines won't ship until summer. My rough memory is that Illumina has been pretty good about hitting their product launch targets, so perhaps by end of summer I could try one out. It also certainly puts greater impetus on Ion Torrent to drive performance of their system. Illumina is claiming that sample-to-data time will be faster and about 2X the data of the new chip, plus paired end information. Ion Torrent's challenge areas could tip things back the other way -- halve the sample-to-data time, double the read length and double the accuracy.

Illumina also announced they are buying sample prep kit company Epicentre. Epicentre's amazingly clever transposon-based library prep system was one ingredient in a recent whole-genome haplotyping paper (which I was supposed to write on by now; gotta get that done!). They've also demonstrated the ability of this approach to build libraries from amazingly small input amounts. It's a natural buy for Illumina (Epicentre also makes kits for mRNA and miRNA preparation), but it's also a bit bittersweet to see it happen as this means these technologies won't be available (or at least won't be developed as energetically) for other platforms. Of course, if newer platforms can use libraries designed for Illumina, that issue won't apply (as I've noted before).

Tuesday, January 11, 2011

Ion Torrent Throws Another Chip Into the Pot

At the J.P. Morgan conference today, Ion Torrent announced the launch of their second generation chip ("316"), which offers 10X the data generation at apparently 2X the list price (some of this is based on earlier information, which isn't always consistent). Actual chips in the hands of customers is stated to occur in this quarter.

Delivering upgrades to the systems via the consumables rather than new instruments is a big part of the promise of Ion Torrent's system, and so actually delivering the chips is a key part of fulfilling that promise. The GenomeWeb article made no mention of any 3rd generation chip, and certainly it will be the regular release of upgraded chips that will really convince the community that this is for real.

The new chip is described as delivering 100Mb per run, or about 1 million reads of 100 bases each (again, ballpark). It's useful to put that in the context of various possible uses. In my eye, the Ion Torrent still will just not have the number of reads for applications requiring counting tags -- e.g. RNA-Seq or ChIP-Seq. But, for PCR amplicons this would be pretty amazing. Imagine a pool of 100 amplicons; this would mean on average 10K reads per amplicon, which would allow great sensitivity for rare variant detection (either in pooled samples or heterogeneous cancer biopsies). As far as a human genome goes, 100Mb is about 0.025X, so it would not be cost effective (vs. Illumina or SOLiD) or pleasant (imagine snapping all those chips in to the instrument!) to go for 40X human coverage. On the other hand, that is enough to give copy number profiling information. It is also about 2X a human exome chip, which isn't nearly enough either. But, for a small genome it's pretty decent -- certainly good enough for 40+X coverage of many microbes. A de novo project might need other data, but for resequencing of industrial or clinical variants this could be quite interesting.

1M reads is also in the range of what 454 Senior delivers per run. Of course, those are much longer reads -- if you have and need the length you'll care. But given the upfront cost is so much smaller & the per run cost a tiny fraction, it would suggest that the 454 platform is going to quickly be delegated to niche status. Ion Torrent has claimed in interviews that much longer reads have been seen in house, but it isn't clear when these protocols will be rolled out or if they are really robust.

The GenomeWeb item also suggests a reasonably healthy initial uptake of the platform, with 60 orders booked. However, only "early access" customers have gotten any. It is also rumored that Life Tech is strongly encouraging multiple purchases by customers; someone I know with experience in this noted that this was the pattern with their capillary sequencers. On the one hand, this has practical advantages (beyond getting a few more sales), as the rollout can be staged to a smaller number of sites initially. New technologies always have their hiccups. But, on the negative side it does mean that fewer sites will have an opportunity to put the machine through new paces -- and compete in Ion Torrent's Grand Challenges (must write more on that another time).

Ion Torrents haven't started showing up on the wonderful World Sequencer Map yet, but it should be a matter of time. More seriously, no service provider has yet announced support for this platform. That's a pity, since that would enable much wider access to the capabilities. I can't promise I'll be first in line with an order, but I certainly will be if I have a project then that fits the specs for the machine.

Monday, January 10, 2011

Chromothripsis: Cratering of Chromosomes

A stark frame from Apollo 16 shows a lunar surface remodeled by violent collisions. Even in a single static snapshot hints at the order of events. The large crater near the center of the image was later remodeled by a not small crater breaking the original rim. Careful study of other photographs, especially of the even more chaotic far side of the moon, can piece together the temporal order of events, from such overlaps in craters and their ejecta.

The moon has more than a few parallels to the genomic chaos present in cancer. Most cancers are characterized by a degree of aneuploidy, though a few only have small numbers of genomic alterations, much like the smoother regions pictured. Some cancer cells have chromosome complements like the far side of the moon, battered to the brink of non-recognition. And like the two large craters, we often picture the chromosome alterations as being sequential, with long gaps in between. Such step-wise changes are seen as mirrored in stepwise changes in biology; each new change has the potential to bring new advantages to a proto-tumor cell. A corollary of this is that these changes are ongoing; today's tumor has more changes than several months ago, and a sample several months in the future will look yet different. Indeed, one criticism sometimes launched at cancer genomics efforts is that they are taking a single snapshot of a fast-moving phenomenon.

If you look closer at the lunar photo, another interesting feature will emerge. What might first appear to be a slash pointing at the large crater is actually a chain of craters. These are believed to have arisen from a single object fragmenting before impact, much as the comet Shoemaker-Levy did a number of years ago prior to crashing into Jupiter. Such crater chains are seen many times on our own moon, as well as on Mars and moons of Jupiter. A new paper in Cell brings our attention to the same phenomenon in a subset of cancers, in which genomes are multiply mangled in a single event. And, like Apollo 16, the key evidence for understanding very dynamic events are static genomic snapshots.

For the impatient such as this author, Figure 1 C&D is perhaps the quickest way to dispel doubt that such occurs. These use Circos diagrams to depict the chromomal rearrangements present in a chronic lymphocytic leukemia (CLL) sample at initial presentation and relapse: the rearrangement diagrams look like carbon copies, despite 42 distinct alterations in the diagnosis samples. These were identified by high-throughput paired-end sequencing. Many of these breakpoints are locally clustered on the original chromosome; the long arm of chromosome 4 suffered nine breaks, each re-connected to a different chromosome. Clustered breaks are sometimes remarkably tightly packed: seven rearrangments are from a single 30 kilobase region and another 6 from a 25 kilobase region. However, fragments from such near origins are flung around, often ending up distant from each other in the tumor chromosome.

Digging deeper, a curious pattern emerges from looking at the pattern of loss of heterozygosity versus copy number. Clearly the single copy regions cannot be heterozygous, but what is striking is that all of regions of copy number two are heterozygous. This clearly indicates that none of the copy number two regions are the result of a reduction to single copy followed by an expansion back to two copies; heterozygosity lost cannot reappear. Change points for copy number fall essentially evenly into the four possible intrachromosomal buckets: deletions, head-to-head inverted, tail-to-tail inverted, and tandem duplications.

Now, this could be a single odd case. The researchers combed through the masses of publically available cancer cell line copy number and LOH data (which they largely generated) to find more. Out of 746 cancer cell lines, 18 (2.4%) showed the pattern of frequent changes in copy number in local regions of chromosomes. So the phenomenon appear to not be unique to their original sample, though present in a small minority of cell lines. They selected four cell lines for paired end sequencing to further explore this.

The sequencing reveals the same pattern, though sometimes even more intensely. One line has 239 rearrangements of chromosome 15; another 77 alterations of the short arm of chromosome 9. A third has only 55 rearrangements in chromosome 5. And while these changes are not the only rearrangements in these genomes, the other changes tend to stand aloof from the chromosomes which have been extensively fractured.

Such a chromosome catastrophe (for which the authors coin the term "chromothripsis", "thripsis" is apparently Greek for shattering into pieces) could have a number of beneficial effects for a tumor, as the authors explore. One route is through the amplification of beneficial oncogenes, perhaps through the formation of small auxillary chromosomes. In a small cell lung cancer line, such a chromosome carries only 1.1Mb of original cellular content , but contains the potent oncogene MYC -- and multiple copies even.

Conversely, such a radical remodeling of a chromosome could potentially destroy multiple tumor suppressors in a single event. Possible examples of such simultaneous multiple disruptions were found in one chordoma and several other samples.

The extensive reordering of chromosomes during chromothripsis can also lead to the formation of fusion genes. But, as noted in the paper, it would seem unlikely that these are frequently tumor drivers (though never confuse unlikely with impossible!). In chronically chromosomally unstable cells, there are presumably many different possible rearrangements generated, increasing the odds of finding one with useful advantages. But with chromothripsis, the cell would have to get lucky on one mighty roll of the dice.

What could perform such a genetic wrecking act? One possibility raised in the paper, which was also the first to jump into my mind, would be high-energy ionizing radiation (probably most commonly X-rays from terrestrial sources, but also perhaps cosmic rays). Imagine a condensed chromosome and then picture an energetic particle hitting that chromosome. Some paths through the chromosome would result in one or a few breaks -- but some paths might enfilade the DNA, riddling it with breaks. Alternatively, broken telomeres can lead to the inappropriate fusing of chromosome ends, followed by rebreakage when the chromosomes are pulled to opposite poles during cell division (which in turn now have sticky ends). While this wouldn't technically be simultaneous destruction, a few successive rounds of such "breakage-fusion-bridge cycles" during cell division can generate quite a mess. It is striking and consistent with this that several examples found in the paper involve rearrangements near the end of one chromosomal arm. It will be fascinating to see if anyone tries to replicate these phenomena in controlled settings and try to create chromothripsis in the lab, and thereby determine which mechanism is more consistent with the data -- or perhaps to what fraction each mechanism occurs and any distinguishing genomic artifacts for each mechanism.

This paper has certainly altered my perspective of the chromosomal chaos seen in cancer. My previous gradualist mental model now has an asterisk. Where I previously pictured cells in a constant process of continuing rearrangement and amplification, now I can see that in some cases the genome is actually relatively static. There is a minority school of thought which has argued that aneuploidy is the critical feature of cancer and that any chemotherapeutic regimine is doomed by the dynamics of aneuploidy. Here are clearly some counterexamples; tumors which are not constitutively unstable and which do not progress after treatment due to further rearrangement. It also is a useful paper to recall the next time someone dismisses cancer genomics with the argument that cancers are dynamic, but a genome analysis looks only at one point in time. Sometimes, if you look carefully enough, a single snapshot can tell a very dynamic story.
P.J. Stephens, C.D. Greenstein, B. Fu, F. Yang, G.R. Bignell, L.J. Mudis, E.D. Pleasance, K.W. Lau, D. Beare, L.A. Stebbings, S. McLaren, M-L Lin, D.J. McBride, I. Varela, S. Nik-Zainal, C. Leroy, M. Jia, A. Menzies, A.P. Butler, J.W. Teague, M.A. Quail, J. Burton, H. Swerdlow, N.P. Carter, L.A. Morsberger, C. Iacobuzio-Donahue, G.A. Follows, A.R. Green, A.M. Flanagan, M.R. Stratton, P.A. Futreal, & P.J. Campbell (2011). Massive Genomic Rearrangement Acquired in a Single Catastrophic Event during Cancer Development Cell, 144 (1), 27-40 : 10.1016/j.cell.2010.11.055

[2011-02-03 Fixed URL for paper; Research Blogging apparently has been failing to build it correctly]

Thursday, January 06, 2011

Oncogenesis Via Altered Enzyme Specificity, Part II

As promised in the EZH2 story, there is another story of cancer-causing mutations tuning an enzyme in an interesting way. It's also a great story of how multiple high-throughput methods can create and exploit an entirely new angle on cancer. I'll try to do a good job on this, but I'm lucky enough to have as regular readers of this space several of the authors who are referenced here, which should enable any egregious errors on my part to be flagged. I'm also trying to tell the main thread of the story as I can see it, and apologize in advance for getting priorities of discovery incorrect. I'm relying on final publication dates for organizing the timeline, which is certainly not a perfect strategy.

First, we have to go back just over two years ago to the late summer and early fall of 2008. Two different groups reported on initial cancer genomics investigations of glioblastoma, a devasting type of brain tumor. Both groups used PCR to amplify targets for Sanger sequencing. One group looked at a focused set of genes in 91 tumor samples; the other group looked at many fewer samples (22) but at most known protein-coding exons.

Now this is the sort of decision which was critical: given a particular sequencing budget, do you sequence a lot of targets in a few patients or a select set in more patients? Given we know a lot of oncogenes and tumor suppressors, there is a logic to the focused search. But this was a case where the broad sweep paid off.

What the broad sweep found, but was not included in the focused search, were mutations in the gene for IDH1, a key enzyme in the citric acid cycle. A rapid follow-up study confirmed the recurrent presence of IDH1 mutations in glioblastoma and also mutations in IDH2, another gene encoding a homologous enzyme.

IDH1 and IDH2 encode isocitrate dehydrogenase, a key enzyme in the citric acid cycle, which is also known as the Krebs cycle or tricarboxylic acid cycle (TCA). This set of metabolic reactions is often shown near the center of a large metabolic diagram as a big circle, which is very appropriate. This set of reactions is central to aerobic energy generation as well as the creation of various useful metabolic intermediates. The normal activity of IDH is

Isocitrate + NADP+ <=> 2-Oxoglutarate + CO2 + NADPH + H+

As with all reactions of the TCA, this reaction is reversible; under some conditions some cells will convert 2-oxoglutarate to isocitrate. Also keep in mind that a common synonym for 2-oxoglurate is alpha-ketoglutarate or aKG.

Now the influence of primary metabolism on cancer is a hot topic -- again. Back in the 1920s Otto Warburg earned a Nobel prize for the observation that tumors seem to rely on anerobic glycolysis more than their normal cousins. The field was pretty cold for a long while but lately it has gotten new interest, including a number of startup companies trying to develop cancer therapeutics. During my last job interruption, I consulted for one of these (Agios), though I have no ongoing financial interest in the company.

A striking observation is the pattern of the mutations: in each case a single arginine residue is mutated, though to multiple possible amino acids. However, these are not equiprobable. In the current COSMIC, there are 1859 reported IDH1 mutations -- and 1468 (78.97%) of those are R132H. Only a handful of mutations have been found outside R132. IDH2 has two hotspot sites, R140 and R172

Now, what are these mutations doing? What is special about IDH? The first attempt to answer this was published in April 2009 and came to the conclusion that the mutant enzyme is less effective at binding its substrate and generating the product alpha ketoglutarate. Furthermore, the mutant enzyme was proposed to poison the wild type copy by the formation of inactive heterodimers. Finally, this was proposed to activate the important HIF1 transcription factor, which regulates a number of tumor-promoting pathways. So in this view of the world, IDH1/2 are tumor suppressors inactivated in glioblastoma.

The part that was unsatisfying about this explanation is that it failed to explain why IDH1/2 mutations are so focused. In general, many mutations can destroy enzymatic function, so tumor suppressor enzymes generally show a diffuse mutation pattern. It is dangerous to think we can think through such biochemical puzzles, but it did mean the solution to the puzzle wasn't a clear winner.

A very different explanation was provided by the group from Agios, published at the end of 2009. Using high-throughput metabolite profiling, their startling discovery is that the IDH1 mutations result in higher levels of 2-hydroxyglutarate (2HG), a compound structurally-related to the normal IDH product alpha-ketoglutarate.They confirmed that the mutant enzyme is no longer capable of driving the normal reaction, but that it now catalyzes an analogue of the reverse reaction which uses the aKG and NADPH generated by the wild-type enzyme to generate 2HG. Heterodimers appeared to be capable of both reactions, raising the possibility that heterodimers enable very efficient production of 2HG through the coupling of the two enzymatic activities. Structural studies supported this explanation, and finally increased 2HG levels could be detected in glioblastoma samples mutant for IDH1, but not those wild-type for IDH1.

Around the same time, another key thread entered the story. Several attempts to identify IDH mutations in other cancers had been made, and while a few had been found there wasn't an obvious cancer with a high frequency of mutations. But, the second acute myelogenous leukemia complete genome sequenced by the Wash U group identified an IDH1 mutation and went on to confirm recurrence of IDH1 mutations in just under 10% of AML samples assayed. Now a second tumor type showed IDH recurrence. Further studies identified IDH2 mutations as well in this disease and confirmed that IDH-mutant leukemias accumulate 2HG.

So now we have an odd mutation pulling and interesting trick of changing the reaction specificity of a metabolic enzyme and showing up repeatedly in two very different cancers. But why is this odd metabolite valuable to the cancer? That is where the latest paper comes in. Published last month, it demonstrated a number of features. First, leukemias mutant for IDH1 or IDH2 show a distinctive DNA methylation profile, one which is not specific for which enzyme is mutated. This methylation profile also shows a greater degree of methylation than most other AML samples. Second, the RNA expression profiles for these tumors is not quite as highly clustered. Third, expression of mutant IDH enzymes in cell lines raises the amount of 5-methylcytosine in their DNA.

The big clue uncovered is that IDH1/2 mutations are not only mutually exclusive, they are also strictly exclusive with another recurrent mutation in AML, those inactivating the enzyme TET2. More strikingly, TET2's enzymatic role appears to be the first step in demethylating DNA -- and TET2 requires alpha ketoglutarate! Indeed, co-expression of TET2 and IDH1 mutant (R132H) reduced the degree of formation of the TET2 product (5-hydroxy-methylC) vs. TET2 + wild-type IDH1. Furthermore, TET2-mutant leukemias actually show a similar methylation profile as IDH1/2-mutant leukemias.

How does this drive leukemogenesis? Looking at the differentially-methylated sites in IDH1/2 mutant AMLs versus other AMLs, an enrichment for motifs associated with the transcription factors GATA1/2 and EVI1, both known to be important in myeloid differentiation. 40% of the genes in the IDH1/2 signature are known targets of GATA2 and 19% direct targets of GATA1. Furthermore, GATA1 was hypermethylated in their patient cohort, suggesting two levels of suppression of this pathway. Finally, mutant IDH expression or loss of TET2 function was shown to generate more cells with stem-like characteristics, a hallmark of leukemias. In particular the oncogenic kinase c-KIT showed higher expression; mutational activation of c-KIT characterizes yet another subset of AML.

So in just over two years, we've gone from high-throughput sequencing finding a curious recurrent mutation, to a novel oncogenic modification of metabolism and now a mechanistic explanation of how this drives leukemias. I've left out a lot of other literature using these mutations to guide better prognosis in cancers and the identification of recurrence of IDH mutations in some other tumor types, notably thyroid tumors. Curiously, another set of thyroid tumors appear to be wild-type for IDH1/2 (at least in the hotspot) but have elevated levels of 2HG. Germline IDH2 mutations have also been identified in a subset of patients with abnormal levels of 2HG. Some patients have inactivating mutations in a different gene, succinic semialdehyde dehydrogenase; will this show up as mutant in yet another set of cancers?

So what next? Ideally the clinical value of these findings would go beyond simply staging patients. There are hints that some chemotherapies may perform better or worse in the context of these mutations. Ideally, therapies directed at inhibiting the mutant IDH activity (whilst sparing the wild-type activity) will be developed. The higher expression of c-KIT in IDH1/2 and TET2 mutant AMLs may suggest the use of c-KIT inhibitors. Certainly one suggestion is to look in other IDH1/2 mutant tumors and in 2HG-elevated IDH1/2 wild-type tumors for distinctive hypermethylation. With larger and larger mutational datasets, more mutations may be found which are clearly mutually exclusive with IDH mutations (exclusion with NPM has also been observed in leukemia); such findings could lead to identifying further genes affecting genome methylation.
Figueroa ME, Abdel-Wahab O, Lu C, Ward PS, Patel J, Shih A, Li Y, Bhagwat N, Vasanthakumar A, Fernandez HF, Tallman MS, Sun Z, Wolniak K, Peeters JK, Liu W, Choe SE, Fantin VR, Paietta E, Löwenberg B, Licht JD, Godley LA, Delwel R, Valk PJ, Thompson CB, Levine RL, & Melnick A (2010). Leukemic IDH1 and IDH2 mutations result in a hypermethylation phenotype, disrupt TET2 function, and impair hematopoietic differentiation. Cancer cell, 18 (6), 553-67 PMID: 21130701

Tuesday, January 04, 2011

Oncogenesis Via Altered Enzyme Specificity, Part I

(Correction: a friend close to the story pointed out EZH2 is a lysine, not arginine methyltransferase. Stupid mistake! -- though I got it right once in the original version -- small consolation)

There's a bit of an involved story I've been meaning to put together & now another paper with a similar theme showed up. After some thought, I realized that the second story should go first.

Oncogenes are genes which when added to a cell can transform it to a cancerous state. A number of different classes of proteins can be oncogenic, but quite a few are either transcription factors or enzymes. I'm going to focus here on enzumes.

Oncogenic enzymes somehow have an enzymatic activity which promotes cell growth. A lot of oncogenic enzymes are protein kinases, and these can be activated by a number of mechanisms. For example, some are activating simply by being overexpressed, which in cancer occurs most commonly by amplification of the underlying chromosomal DNA. Another recurrent mechanism is the removal of inhibitory domains. Other changes alter the equilibrium between active and inactive states. Certain kinases are activated by dimerization, so some oncogenic mutations enhance dimerization. For example, in some fusion kinases, in which a chromosomal rearrangement has fused a kinase with another protein, a key role of the partner protein is to supply a dimerization motif.

The RAS family of GTPases are an interesting variant on this theme. RAS proteins (KRAS, HRAS and NRAS being the most important oncogenes) transmit growth-promoting signals when they have a bound GTP. They also have a slow GTPase activity which hydrolyzes the GTP to GDP, and when RAS proteins have GDP bound they no longer transmit the signal. Exchange of the GDP to GTP reactivates the growth signal. Oncogenic KRAS mutations slow or eliminate the GTPase activity; without this activity the gene never turns off. Hence, only a small number of possible mutations in KRAS will successfully turn it into an oncogene, since mutations must inactivate the GTPase without altering the other functions of the protein.

The two stories, one brand new related here and one which hit a fascinating milestone recently which will be in a future installment, are cases of additional ways enzymes can be altered to promote tumors. In each case, rather than activating or inactivating an enzyme the mutations succeed in tuning the activity of an enzyme in a way favorable to cancer.

About a year ago the Vancouver cancer genomics group published the identification of recurrent mutations in lymphomas of the gene EZH2, a histone methyltransferase. Strikingly, the mutations are strongly concentrated on a single change, modifying Tyr641, though to a number of other amino acids. So what is so important about Tyr641? A new paper provides the mechanistic explanation.

Histone methyltransferases such as EZH2 add a methyl group to lysine residues (other methyltransferases can methylate arginine, which I mistaken pegged EZH2 in the original version of this). Any given lysine can actually have 4 different methylation states: none, single, double, or tri. This means in turn that an lysine methyltransferase has three types of substrates: those with 0, 1 or 2 existing methyl groups. What the new work shows is that Tyr641 is important in selecting the substrates, and these mutations focus the activity on converting dimethyl lysine to trimethyl lysine.

Several lines of evidence point to this conclusion. Two other lysine methyltransferases have been shown to prefer trimethylation when mutated in an analogous way. Molecular modeling suggests that this tyrosine serves to inhibit effective operation on dimethyl substrates. In vivo the mutation acts dominantly to increase trimethylated lysine levels on histones and in vitro the appropriate complex has an increased preference for dimethylated peptides.

This is the first such reported disease-causing mutation of this sort, though as noted above similar mutations have been created by scanning mutagenesis. Will we see other ones? There are many other homologous methyltransferases, but a quick sampling of COSMIC doesn't reveal a homolog with any recurrent pattern of mutation. It's worth keeping a lookout for one, but if not then a new mystery will remain to be explored: why is EZH2 special in this regard?

Going a bit farther afield, could there be oncogenic mutations in kinases which alter the substrate specificity? Given that some kinases require prior ("priming") phosphorylation of substrates, could a mutation in the kinase reduce this requirement? Alternatively, do some kinases phosphorylate both cancer-promoting and cancer-retarding substrates? If so, could mutations exist which shift the balance towards cancer promotion? Seems like a long shot, but who would have guessed in advance of mutations like the EZH2 ones?
Yap DB, Chu J, Berg T, Schapira M, Cheng SW, Moradian A, Morin RD, Mungall AJ, Meissner B, Boyle M, Marquez VE, Marra MA, Gascoyne RD, Humphries RK, Arrowsmith CH, Morin GB, & Aparicio SA (2010). Somatic mutations at EZH2 Y641 act dominantly through a mechanism of selectively altered PRC2 catalytic activity, to increase H3K27 trimethylation. Blood PMID: 21190999

Monday, January 03, 2011

Semianalogy to the Semiconductor Industry?

One area where Jonathon Rothberg has gotten a lot of mileage in the tech press is with his claim that Ion Torrent can successfully leverage the entire semiconductor industry to drive the platform into the stratosphere. Since the semiconductor industry keeps building denser and denser chips, Ion Torrent will be able to get denser and denser sensors, leading to cheaper and cheaper sequencing. It's an appealing concept, but does it have warts?

The most obvious difference is that your run-of-the-mill semiconductor operates in a very different environment, a quite dry one. Ion Torrent's chips must operate in an aqueous environment, which presumably means more than a few changes from the standard design. Can any chip foundry in the world actually make the chips? That's the claim Ion Torrent likes to make, but given the additional processing steps that must be required it would seem some skepticism isn't out the question.

But perhaps more importantly, its in the area of minaturization where the greatest deviation might be expected to occur. Most of Moore's law in chips has come from continually packing greater numbers of smaller transitors on a chip. Simply printing the designs was one challenge to overcome; with finer designs comes a need for photolithography with wavelengths shorter than visible. This is clearly in the category of problems with Ion Torrent can count as solved by the semiconductor industry.

A second problem is that smaller features are less and less tolerant of smaller and smaller defects in the crystalline wafer of which chips are fabricated from. Indeed, in memory chips the design carries more memory than the final chip so that some can be sacrificed to defects; if excess memory units are left over after manufacturing they are shorted out in a final step. Chips with excessive defects go in the discard bin, or sometimes allegedly are sold simply as lower grade memory units. One wonders if Ion Torrent will give all their partial duds to their methods development group or perhaps give them away in a way calculated to give maximal PR impact (to high schools?).

But, there are other problems which are quite different. For example, with chips a challenge at small feature sizes is that the insulating regions between wires on the chip become so narrow as to not be as reliable. Heat is another issue with small feature sizes and high clock speeds. These would seem to be problems the semiconductor-industry won't pass on to Ion Torrent.

On the other hand, Ion Torrent is trying to do something very different than most chips. They are measuring a chemical event, the release of protons. As the size of the sensor features decrease, presumably there will be greater noise; at an extreme there would be "shot noise" from simply trying to count very small numbers of protons.

Eventually, even the semiconductor industry will hit a limit on packing in features. After all, no feature in a circuit can be smaller than an atom in size (indeed, a question I love to ask but which usually catches folks off-guard is how many atoms are, on average, in a feature on their chip). One possible route out for semiconductors is to go vertical; stacking components upon components in a way that avoids the huge speed and energy hits when information must be transferred from one chip to another. It is very difficult to see how Ion Torrent will be able to "go vertical".

None of this erases that Ion Torrent will be able to leverage a lot of technology from chip manufacturing. But, it will not solve all their challenges. The real proof, of course, will be in Ion Torrent regularly releasing new chips with greater densities. An important first milestone is the on-time release of the second generation chip this spring, which is touted as generating four times as many reads (at double the cost). Rothberg is claiming to have a chip capable of a single human exome by 2012; assuming 40X coverage of a 50M exome, that would require a 200-fold improvement in performance, or nearly four quadruplings of performance. Some of that might come from process or software improvements to increase the yield per chip of a given size (more on the contest in the future); indeed, to meet that schedule in 2 years would insist on either that or a very rapid stream of quadrupling gains.

As I have commented before, they might even pick a strategy where some of the chips trade off higher density for read length or accuracy. For example, applications requiring counting (expression profiling, rRNA profiling) of tags which can be distinguished relatively easily (or the cost of some confusion is small) might prefer very high numbers of short reads.

Sunday, January 02, 2011

First of a Torrent?

For the New Year I've resolved to be a bit more regular in posting here, and lik all New Year's resolutions it is easy to start out big, so there may be a flurry of posting this week. Of course, the real challenge will be to maintain that energy across an entire year. But, to kick-start things I spent the holiday weekend drafting nearly a week's worth of output.

Ion Torrent continues to attract a lot of attention, though its launch last year hasn't yet resulted in my getting hands or eyes on one. Ideally, an evaluation machine would show up but that's happening only in my dreams. Nor did my attempt to win a free one succeed, though one winning entry was of very similar concept (and both were from Massachusetts!). Most of the press has continued to edge towards breathless and unthinking hype, but the counterpoint is in Nick Loman's well-thought bit of exasperation with that hype.

My own thoughts continue to lie in between. I continue to be frustrated by the absurd hype in various tech press outlets, but I also see this as a useful machine. There's a number of interesting angles, which I've decided to tackle with a small series rather than one big lump. Ideally this splitting will result in more coherent arguments on my part, but that's for you to decide.

To me the most frustrating angle is the view that sequencing is a monolith and a single race, with one winner. For Sanger sequencing, this tended to be the case because the underlying technology was so similar and the various platform makers didn't separate much. ABI took the lion's share of the market, Amersham was a distant second and that was almost it. LiCor had the one somewhat differentiated entry, with a different dye system yielding longer reads, but at the cost of greatly reduced throughput. Even these reads were not so much longer (I think they claimed just over a kilobase, whereas ABI routinely got about 3/4 kilobase) to really drive a big niche.

But second-gen has evolved in a very different way. Speed, upfront cost, running cost, library prep, pre-sequencer prep, accuracy and read length are multiple variables in which the different platforms have landed in different boxes. Some of this is inherent in the technologies, whereas others are due to simply design choices or intellectual property positions. An example of the latter is Illumina's patent lock on bridge PCR, whereas other amplification-requiring platforms appear to nearly all use emulsion PCR (Complete Genomics uses rolling circle).
So, to me the question in evaluating a platform and where it is going depends on looking at that particular combination and asking what applications work best. Once that's worked out, the size of the market can be speculated on as well as who else might be bumping elbows in that space.

Now, Ion Torrent has a number of operational features worth noting. First, it has the lowest upfront cost of a sequencer at around $100K fully loaded (sequencer, server & emPCR robots). This is a first point of my annoyance with many glowing articles: they parrot the "$50K" price which buys you just the sequencer. Even worse are the ridiculous claims of Ion Torrent being 1/10th the price of the competition; this is comparing only to the highest price alternative offerings and not the likely alternate choice.

Second, the run times are quite fast. But again, many of those enamored with the device mindlessly spout the time to acquire data and not all the up-front prep. Some of that prep will depend on the particular application, but it is still on the order of 2-3 days to go from DNA sample to data off the sequencer. Now, I know the pain of anticipating data, having recently gnawed my nails off waiting for a high-stakes paired end SOLiD 4 run (closer to two weeks than one), but the truth is a number of other platforms offer similar speed (more in another installment).

Third, the initial release is claiming about 100,000 reads of 100 bp or more (up to about 200). The chip costs $250 and there is another $250 to prep the sample; it is unclear from anything I've seen what is included in that prep cost and in particular how many runs you can get from one such prep. For example, if that includes library adapters and I'm using a direct PCR approach, then that $250 cost is actually inflated. More importantly, if I need more than 100K reads for an application, does that $250 or prep buy me more than one run (i.e. will 200K reads from one sample cost me $750 or $1000?). Error rate is not clear and homopolymers will be a problem, though the probability of miscalling these isn't well documented.

Given these fuzzy estimates, what sort of applications will be best for the Ion Torrent platform in its initial state? To me, and clearly to others, the sweet spot is sequencing of targeted and easily interpetable regions. The two U.S. contest winners (was the European giveaway ever executed?) are just along those lines.

A group at MGH is planning to perform PCR-based targeted sequencing of cancer. This is a very appropriate application which fits many of properties of Ion Torrent. Many cancer mutations are what we call "hotspot mutations"; the same mutations are seen repeatedly. For example, in the very important KRAS oncogene the vast majority of mutations occur in any of the six nucleotides of two adjacent codons. Design your PCR assay correctly, and all you would need is a six base pair readlength (indeed, several tests approved for the clinic or on their way there could be seen as 1-bp read length sequencing assays). More realistically, you need to set the primers back a bit from the hotspot and read through the primers, but for this the 100 bp reads of Ion Torrent will be quite good. Now, this hotspot behavior governs most, but not all, activating mutations in oncogenes. This can be seen as it being hard to turn something on by tinkering with it, though in a few cases the tinkering is by removing a whole inhibitory exon and there are many ways to do that. On the other hand, many tumor suppressors are mutated in a diffuse pattern. Sometimes there are hotspots due to particular mutation processes or other forces, but these are never as hot.

The other winning entry was from Woods Hole Marine Biological laboratory to rapidly profile to identify bacterial contamination of water. Again, PCR-based and looking at well-defined signatures, in this case ribosomal RNA profiles.

Each example fits well into the Ion Torrent's capabilities. In both cases, you don't need enormous numbers of reads to do a decent job, though more reads would let you either look for rarer species or assay more loci. Since you are looking for signatures, the assays can be calibrated well in advance versus the error and read length characteristics of the platform. For example, you can know in advance where there are homopolymer runs and adapt for them. Offhand, I can't think of an oncogenic hotspot that involves a homopolymer run and in oncogenes the frame must generally be preserved (again, there are those rare non-coding oncogenic changes) so that would help constrain errors.

Given that sweet spot who is going to feel Ion Torrent's elbows? The obvious candidate is Roche. The 454 GS Jr is around 2-3X the upfront cost (again, for a complete infrastructure), around 4X the cost per run and will yield 0.5-1X the number of reads -- but much longer ones. However, for both the applications above long reads aren't really such a great advantage. Again, for many of your signatures you can design the signature around the read length, and really long reads add only a bit more value. For dealing with clinical cancer samples, you really want to keep your PCR amplicons down to 250bp or less because the DNA you get is generally quite fragmented and has other impediments to PCR. With a protocol that can read in from each end of the PCR fragments (perhaps randomly or perhaps in two separate runs, one from each end), Ion Torrent's current length fits well. Short signatures will work better on both platforms in any case, as in the real world you get some reads that peter out much sooner -- short signatures mean more effective reads of a signature per run. 454 has a more established chemistry and performance specs, but I would expect Ion Torrent to be serious competition for the Jr platform, with the 454 family holding on to the applications (such as HLA haplotyping) where length really does matter. Holding on, that is, until Ion Torrent can push their read lengths to similar territory.

That's a pretty big lump. Next installment (not necessarily tomorrow; I might interleave some other topics burning on my desk), looking at the much stressed tie to the semiconductor industry.