Omics! Omics!: August 2010

Tuesday, August 31, 2010

Worse Could Be Better

My eldest brother was in town recently on business & in our many discussions reminded me of the thought-provoking essay "The Rise of 'Worse is Better'". It is on a thought train similar to Clayton Christensen's books -- sometimes really elegant technologies are undermined by ones which are initially far less elegant. In the "WiB" case, the more elegant system is too good for its own good, and never gets off the ground. In Christensen's "disruptive technology" scenarios, the initially inferior serves utterly new markets priced out by the more elegant approaches, but the inferior technology then nibbles slowly but surely to replacing the dominant one. But a key conceptual requirement is to evaluate the new technology on the dimensions of the new markets, not the existing ones.

I'd argue that anyone trying to develop new sequencing technologies would be well advised to ponder these notions, even if they ultimately reject them. The newer and more different the technology, the longer they should ponder. For it is my argument that there are indeed markets to be served other than $1K high quality canid genomes, and some of those offer opportunities. Even existing players should think about this, as there may be interesting trade-offs that might go after totally new markets.

For example, I have an RNA-Seq experiment off at a vendor. In the quoting process, it became pretty clear that about 50% of my costs are going to the sequencing run and the other 50% of costs to library preparation (of course, within both of those are buried various other costs such as facilities & equipment as well as profit, but those aren't broken out). As I've mentioned before, the costs of the sequencing are plummeting but library construction is not on such a steep trend.

So, what if you had a technology that could do away with library construction? Helicos simplified it greatly, but for cDNA still required reverse transcription with some sort of oligo library (oligo-dT, random primers or a carefully picked cocktail to discourage rRNA from getting in). What if you could either get rid of that step, read the sequence during reverse transcription or not even reverse transcribe at all? A fertile imagination could suggest a PacBio-like system with reverse transcriptase immobilized instead of DNA polymerase. Some of the nanopore systems theoretically could read the original RNA directly.

Now, if the cost came down a lot I'd be willing to give up a lot of accuracy. Maybe you couldn't read mutations out or allele-specific transcription, but suppose expression profiles could be had for tens of dollars a sample rather than hundreds? That might be a big market.

Another play might be to trade read length or quality of an existing platform for more reads. For example, Ion Torrent is projected to initially offer ~1M reads of modal length 150 for $500 a pop. For expression profiling, that's not ideal -- you really want many more reads but don't need them so long. Suppose Ion Torrent's next quadrupling of features came at a cost of shorter reads and lower accuracy. For the sequencing market that would be disastrous -- but for expression profiling that might be getting in the ballpark. Perhaps a 16X the initial chip -- but with only 35bp reads -- could help drive adoption of the platform by supplanting microarrays for many profiling experiments.

One last wild idea. The PacBio system has been demonstrated in a fascinating mode they call "strobe sequencing". The gist is that the read length on PacBio is largely limited by photodamage to the polymerase, so letting the polymerase run for a while in the dark enables spacing reads apart by distances known to some statistical limits. There's been noise about this going at least 20K and perhaps much longer. How long? Again, if you're trapped in "how many bases can I generate for cost X", then giving up a lot of features for such long strobe runs might not make sense. But, suppose you really could get 1/100th the number of reads (300)-- but strobed out over 100Kb (with a 150bp island every 10Kb). I.e. get 5X the fragment size by giving up about 99% of the sequence data. 100 such runs would be around $10K -- but would give a 30,000 fragment physical map with markers spaced about every 10Kb (and in runs of 100Kb). For a mammalian genome, even allowing for some loss due to unmappable islands, that would be at least a 500X coverage physical map -- not shabby at all!

Now, I won't claim anyone is going to make a mint off this -- but with serious proposals to sequence 10K vertebrate genomes, such high-throughput physical mapping could be really useful and not a tiny business.

Sunday, August 29, 2010

Who has the lead in the $1K genome race?

A former colleague and friend has asked over on a LinkedIn group for speculation on which sequencing platform will deliver a $1K 30X human genome (reagent cost only). It is somewhat unfortunate that this is the benchmark, given the very real cost of sample prep (not to mention other real costs such as data processing), but it has tended to be the metric of most focus.

Of existing platforms, there are two which are potentially close to this arbimagical goal (that is, a goal which is arbitrary yet has obtained a luster of magic through repetition).
ABI's SOLiD 4 platform can supposedly generate a genome for $6K, though even with pricing from academic core labs I can't actually buy that for less than about $12K (commercial providers will run quite a bit more; they have the twin nasty issues of "equipment amortization" and "solvency" to deal with).
The SOLiD 4 hq upgrade is promised for this fall with a $3K/genome target. Could Life Tech squeeze that out? I'm guessing the answer is yes, as the hq does not use an optimal bead packing. Furthermore, the new paired end reagents will offer 75 bp reads in one direction but only 25 in the other.
I've never understood why a ligation chemistry should have an asymmetry to it (though perhaps it is in the cleavage step), so perhaps there is significant room for improvement there. Of course, those possible 40 cycles are not free, so whether this would help with cost/genome is not obvious (though it would be advantageous for many other reasons). Though, since they can currently get a 30X genome on one slide longer reads would enable packing more genomes per slide & perhaps that's where the accounting ends up favoring longer reads.

Complete Genomics is the other possible player, but we have an even murkier lens on the reagent costs per genome, given that Complete deals only in complete genomes and only in bulk. But, they do have to actually ensure they are not losing money (or at least, with their IPO they won't be able to hide the bleed). Indeed, Kevin Davies (who has a book on $1K genomes coming out) replied on the thread that Complete Genomics has already declared to be at $1K/genome in reagent costs. Perhaps we should move the target to something else (Miss Amanda suggests that $1K canid genomes are far more interesting).

What about Illumina? With HiSeq, they are supposedly at $10K/genome with the HiSeq and many have noted that
the initial HiSeq specs were for a lower cluster packing than many genome centers achieve. That also brings up an interesting issue of consistency -- how variable are cluster packings & therefore the output per run. In other words,
what sigma are we willing to accept in our $1K/genome estimate? Also, the HiSeq specs were for shorter reads than the 2 x 150 paired end
reads that are quite common in 1000 genomes depositions in the SRA (how much longer can Illumina go?).

So, perhaps any of these three existing platforms might meet the mark (454 is a non-starter; piling up data cheaply is not
its sweet spot). What about the ones in the wings? Of course, these are even murkier and we must rely even more on their maker's
projections (and potentially, wishful thinking).

IonTorrent's technology (to be re-branded by Life Tech?) isn't nearly there right now. For $500 (the claim is) you'd get 150Mb of data, or about 0.1X for $1000, so we need about 300X improvement. However, there should be a lot of opportunity to improve. The one touted most in the past is further improvement in the feature density; Ion Torrent was apparently already working on a chip with about 4X the number of features. If we round 300 to 256, then that would only be 4 rounds of quadruplings. If Life could pump those out every 6 months, then that would only be two years to a $1K genome. Who knows how realistic that schedule would be?

But IonTorrent could push on other dimensions as well. Because the flowcell itself is a huge chunk of the cost of a run, squeezing longer read lengths should be possible. Since 454 gets nearly 500 basepair reads routinely (and up to a kilobase when things are really humming), perhaps there is a factor of nearly 4 to get from longer reads. In a similar manner, a paired-end protocol could potentially double the amount of sequence per chip (at a cost of perhaps a bit more than double the runtime; not such a big deal if the run is really an hour). Could that be done? I think I have the schematic for an approach (which might also work on 454); trade proposals for sequencing instruments will be put to my employer for consideration! Finally, as noted in a thread on SEQAnswers, IonTorrent is apparently achieving only about a 1/8th efficiency in converting chip features to sequence-generating sites; better loading schemes might squeeze another few fold out. So perhaps IonTorrent really is 1-2 years away from having $1K genomes (much more likely the 2).

Moving on, could Pacific Biosciences (or the Life tech StarLight (nee VisiGen)) technology have a shot? Lumping them together (since we have virtually no price/performance information for StarLight), PacBio is initially promising $100 runs generating ~60Mb, so $1K would get you about 0.2X coverage, or about 150-fold off, which we'll round to 128-fold or 7 doublings. I think they've already been said to be testing a chip with twice the density, plus a better loading scheme to yield around 2X -- so perhaps it's only 5 doublings.

Finally, there are the technologies which haven't yet demonstrated the ability to read any DNA, but could do so and then move quickly (or not). In this category are any nanopore-based systems (which is a dizzying array of approaches) and Gnu Bio's sequencing-by-synthesis-in-nanodrops approach. And perhaps a few more. These don't even work yet, so even speculative price performance information isn't available.

Finally, a quick note about what a $1K genome means. The X-prize folks have set very strong standards, standards which are far beyond what any short read technology could hope to accomplish and also far beyond what many sequencing applications need. The organizers did not super-design them for no reason; there are applications which need that rigor and also it will greatly cut down on false positives. But, as the regular stream of papers shows, much lower standards will suffice to get interesting biology of whole human genomes.

Tuesday, August 24, 2010

Lawyers v. Research Funding?

An ongoing personal quest is to attempt to fill in the gaps in my original education, particularly outside the areas of science in which I feel there exist gaping chasms. Through Wikipedia, books and especially recorded college courses, I slowly patch up what the deficiencies of my education (or all too commonly, my youthful deficiencies in attention during that education) have failed to cover. I'm currently making a third pass through a wonderful course on Roman history since I enjoyed it very much the first two times.

During Rome's early expansion it was ruled by rotating sets of elected officials under a system known to us as the Roman Republic. A series of events (known to scholars as the Roman Revolution) over many decades disrupted this system, culiminating in the replacement of the Republic with the military dictatorship of the Emperors, which would remain until the fall of the empire. An initiating event in the Revolution was an official named Tiberius Gracchus, who in the service of high-minded ideals (rewarding landless soldiers with their own plots on which to support themselves), changed the nature of Roman politics by introducing mob violence to the process (as well as a certain degree of ruthlessness in dealing with the opposition of colleagues).

I fear that yesterday's court decision regarding embryonic stem cell research represents a similar horrible turn. Now, what most commentators will focus on is the very issue of creating human embryonic stem cells and whether the government should finance this. This is an area in which the proponents of both sides of the issue have deeply and sincerely held beliefs which I feel must be respected, though in the end they are fundamentally irreconcilable. But peripheral to that, the case represents a very scary intrusion of lawyers into the research funding process.

One of the claims made by the plaintiffs (in particular, the research James Sherley) is that the new guidelines on what embryonic stem cell research can be funded represent a very real cause of harm to those working on adult stem cell research; they will have more competition for research funding. That is certainly true; if we view research funding for stem cells as a zero sum game (and that is another whole can of balled waxworms I won't dela with). The danger now is that every possible change in federal (or even private?) funding aim will be an opportuntity for litigators to intrude. Wind down project X to fund project Y? LAWSUIT! Either this will dissuade funding from the ebb and flow which is necessary, or a far worse than zero sum game ensues in which funding for science instead funds litigation (or the buy-offs of potential suits which are routine in that field).

Can this genie be stuffed back in the bottle? I'm not legally trained enough to know. Perhaps it was inevitable. Perhaps we need Congress to explicitly forbid it (but would that be legal?) -- and what are the chances of that? Has a terrible Rubicon been crossed; I hope I am wrong in thinking it has.

Saturday, August 21, 2010

Varus! Where are my legions (of data)!?!?

Bring up the subject of outsourcing, and many minds will immediately jump to the idea of a company using outside services to more cheaply replace operations formerly conducted in house. But the other side of the topic is what I frequently experience: outsourcing allows me to access technologies and capabilities which I simply could not afford to do so on my own, or at least try very expensive technologies prior to investing in them. This is very useful, but has its own issues.

I've now gotten data from 4 different large outsourced sequencing projects. Rated on a five star system, they would (in order) be rated less than expected (**), complete failure (*), less than expected (**) and greater than expected (****). Samples for two more projects just shipped out last week. Given that we don't have any sort of sequencer in house (one project above was conventional Sanger) nor can we willy-nilly buy any specialized hardware for target enrichment (two projects involved enrichment), this has been valuable -- though I really wish I could have been able to rate all as greater than expected (or at least one off the charts).

After the quality of the delivered data, my next greatest frustration is with knowing when that data will be delivered. Now a few projects (plus some explicit vendor tests not included in the above) have gone on schedule, but the utter failure had the pain compounded by being grossly overdue (1-3 months, depending on how you quite define the start point) and one of the other projects came in a week overdue.

But even worse than being late is not knowing how late until the data shows up. Partly this revolves around trying to appropriately budget my time, but it also affects transmitting expectations to others awaiting the results.

In an ideal world, I'd have a real-time portal onto the vendor's LIMS -- one cancer model outfit claimed exactly this. But in any case, I'd really like to have regular updates as to the progress of my project -- and especially to what's happening if the vendor has gone into troubleshooting mode.

After all, what these outfits wish to claim is that they will act as an extension of my organization. Now, if the work was going on in house & I was concerned about progress, I could easily pop in and chat with the person(s) working on it. I'm not interested in hanging over someone's shoulder & making them nervous, but I do like to try to at least understand what is going on & what approaches are being used to solve this. Unfortunately, in several outsourcing projects this is specifically what was lacking -- no concrete estimate of a schedule nor any regular communication when projects were overdue.

In a basic sense, I'd like an update every time my project crosses a significant threshold. Now, the exact definition of that is tricky. But, imagine a typical hybridization capture targeted sequencing. The vendor receives my DNA, shears, size selects, ligates adapters, amplifies and has a library. Some QC happens at various stages. Then there is the hybridization, recovery and further amplification. At some point the platform-specific upstream-of-sequencer step occurs (cluster formation or ePCR). Then it goes on the sequencer. Each cycle of sequencing occurs, plus (for Illumina) cluster regeneration and paired end sequencing. Then downstream basecalling (if not in line). Once basecalls are done, then whatever steps occur to get me the data. And that's all the correct workflow: throw in some troubleshooting for problems should they occur.

Now, ideally I could see all of those steps. But how? I really don't want an email after every sequencer cycle. Could something like Twitter be adapted for this purpose?

Happily, the recent experience when I thought of the title for this post the data did finally come in (after some hiccups with delivery) and was quite exciting. So I'm not tearing my lab coat like Augustus. But when vendors try to solicit my business or when I'm rating the experience afterwards, the transparency and granularity of their communication will be a critical consideration. Vendors who are reading this take note!

Tuesday, August 17, 2010

Life Tech Gobbles Ion Torrent

Tonight's big news is that Life Technologies, the giant formed by the merger of ABI and Invitrogen, has acquired Ion Torrent for an eye popping $375M (mixed cash & stock) with another $325M possible in milestones and such.

I'm not shocked Ion Torrent was shopping itself; by linking with an established player Ion Torrent can access marketing channels -- a talent they have displayed a serious handicap in. While Ion Torrent was adept at creating buzz with founder Jonathon Rothberg's rock star presentations and their sequencer giveaway contests, actual marketing infrastructure to follow-up on all the leads generated through those efforts was clearly lacking (as in, they have yet to contact me!).

One interesting detail of the press release is the fact that the price point for their sequencer is placed at "below $100K"; Ion Torrent had previously billed their machine at under $50K. Is this a real shift, or does it simply reflect the true cost once sample prep gear is thrown in?

Now, there are several interesting angles to watch. First, how will Life position their full lineup of sequencers -- now that they have 3 different technologies (SOLiD, Ion Torrent & VisiGen) with very different performance characteristics. Plus, they had the SOLiD PI in line to be an entry level second generation sequencer -- how will this affect that?

Another area to watch is how tightly Ion Torrent is tied into the SOLiD line. While the chemistry is very different, there are opportunities. For example, can the EZ Bead emulsion PCR robots be used for Ion Torrent sample prep (with the whole sample prep issue being a big black box for the technology? Will the same library prep reagents for SOLiD be usable with Ion Torrent? I'd love to see that -- especially if Ion Torrent drives volumes which ultimately result in driving kit costs down. Of course, the biggest question is when can people actually buy one of the beasts?

Roche/454 seemed like a more obvious partner for Ion Torrent -- very similar chemistries & a tie-up of that sort might have meant a very rapid extension of Ion Torrent read lengths. Roche should be quite nervous; between Ion Torrent and Pacific Biosciences they are going to be under extreme pressure in long read niches and their next technology (GE's nanopores) are unlikely to be ready for many years. Ion Torrent could have also been an interesting play for a reagent company looking to jump into sequencing instruments. Such a company could have also brought the right sales network into play. A non-bio player could have happened, but I doubt that would have ended well -- Ion Torrent needs to complete their act & get their machine out to biologists.

Saturday, August 07, 2010

Perchance to dream

I had an amusing dream the other night. Nothing earth shattering: neither starved calves consuming fatted ones nor serpentine molecular orbitals. But, an amusing spin on something I had recently discussed with a friend.

In the dream, I've apparently gotten to a presentation late -- and just missed the announcement of the sample preparation upstream of the Ion Torrent instrument. I look to my side & it's a guy from Ion Torrent with all sorts of stuff in front of him, but when I try to ask him what I missed, he indicates silence. And then I wake up.

How exactly one goes from DNA to the instrument still appears to be a mystery. There is certainly nothing on the Ion Torrent website (which is rather focused on flash, not substance) to suggest it. A reasonable assumption is emulsion PCR, but there are other candidates (e.g. rolling circle).

Given that this is a rather important piece of the puzzle, there are several common guesses for why it is a mystery. One is that there are IP issues to still be resolved. In a similar vein, Ion Torrent just licensed some IP from a British company (DNA Electronics)which sounds like a near clone in terms of approach. A second is that they are still working out what approach to support. A third is that it isn't flashy enough to be worth mentioning.

Interestingly, Ion Torrent has apparently already sold a machine each to the USGS and NCI, plus there are the ones promised to the grant winners (which alas, I am not one of -- though a winning proposal was a kissing cousin of mine). And Ion Torrent certainly hasn't started beating the bushes hard for sales.

I'm still very eager to try out the Ion Torrent box. While it won't replace some of the other systems for many applications, the cost profile of Ion Torrent will open up very high throughput sequencing to many more labs. I have a number of ideas of how I might use one rather frequently. Now if only they'd try to sell me one -- and ideally in the real world, and not while I slumber!

Sunday, August 01, 2010

Curse you Larry the CEO!

A bit after getting to my current shop, I requested some serious iron for my work and it was decided I would have a Linux box. The question came up as to which flavor, and after canvassing my networks we went with Ubuntu. I had never administered a Linux system before and had to learn the whole package installation procedure, which is so easy even I could learn it. The "apt" tool works beautifully 99% of the time, not only getting and installing the package of interest but also all its dependencies. The occasional exceptions were cases where either the package of interest didn't seem to be available from a package repository or the Ubuntu repositories were behind the version I needed. But in general, it was nice and painless.

Earlier this year, it was clear I needed an Oracle play space and the obvious place was my machine -- not only is it quite powerful, but then any blow-back from any misdeeds of mine would hit only the perpetrator. However, when our skilled Oracle expert contractor tried to install Oracle, not much luck -- Oracle apparently doesn't support Ubuntu well. So the decision was made to switch to Red Hat.

This did not go cleanly -- the admins were fighting with the reinstall most of the week (the RAID drive had protections on it that did not wish to go quietly) but finally the new system was configured on Friday. So on Saturday night, I declared to "Amanda, I know what we are going to do today! Install packages!".

Now, I've actually made a consistent habit here of logging all my installs, so I had a menu of what to try to install. Some quick Googling found some guides to using the different installation tools on Red Hat. So I started trying to install stuff. A few went cleanly, but that is definitely the rarity -- and the worst part is that R is proving to be a major headache.

The problem is trying to get all the dependencies to install, and R has a heap. The fact that many have "-devel" in the title can't make things easy. Worse, one package required "tetex-latex" is no longer supported by its creator. Despite configuring multiple repositories and trying to download some packages manually, I have made little headway so far. So from that standpoint, at the moment my system is "Busted!".

Now, I could blame our contractor, but how was he to know this would be so miserable (though the comment by someone at Red Hat support that this is the first time he'd heard of someone going from Ubuntu to Red Hat does give pause!)? I could also take umbrage with the Linux community, which seems to be a hydra of endless subvariants (Ubuntu, Debian, Red Hat, Red Hat Enterprise, CentOS, Fedora, Mandriva -- and I'm sure that's an incomplete list!). But, it's easiest to blame Oracle, who doesn't support Ubuntu, and if I'm going to do that I'll single out the face of Oracle. On the other hand, it's a bit pointless to hold anger over this against Mr. Ellison. He's a CEO; they don't do much.

Omics! Omics!