Omics! Omics!: May 2011

Thursday, May 26, 2011

Paying a Painful 75% Secrecy Tax

In a post a while back, I mentioned that my Ion Torrent sequencing project was stalled because my service provider couldn't get some of the key kits, despite an Ion representative posting that no such shortages existed. I've been remiss in updating that; last Tuesday the kits showed up and Monday I got my data -- and a bit of a shock.

Monday, May 23, 2011

MiSeq's First Light

Someone was kind enough to send me a copy of a poster by Illumina reporting results from the MiSeq. Now, to be very upfront, by someone I mean "a person from a PR firm contracted by Illumina" and "kind enough" that she was doing her job. I don't have illusions about motive here, the author list is all from Illumina or Epicentre, but this was a poster presented at the recent Cold Spring Harbor Biology of Genomes meeting. It certainly isn't peer-reviewed data, but it is something. Of course, we can't know to what degree these are cherry-picked results. If you want to be really cynical, call it messaging and not data. Yes, I've taken some flak in the comments recently about how favorable my coverage of Ion has been, and I'm trying to adjust. And don't worry; I have some new bones to pick in that area.

Thursday, May 19, 2011

Forums: Open Beats Closed Hands Down

Around the Internet, there are a number of communities in which scientists can swap useful information. SEQAnswers is a very useful site I frequent; BioStar is one I don't but probably should. Life Technologies has set up a community around Ion Torrent, and the contrast between that and SEQAnswers is a useful one.

SEQAnswers has a straightforward access policy. Anyone can view content, but to post or reply you must register for a membership. This approach appears to have been very successful, as there is a healthy number of individuals posting to the site. You can browse around and figure out if the site applies, plus search engines such as Google can steer individuals in. SEQAnswers boasts a number of authors of major second generation sequencing analysis packages, including Bowtie, Tophat, BFAST and Samtools, as regular contributors. There is a significant network benefit to this; quality encourages quality and conversely such folks must be judicious in the number of forums they actively participate in. The management of SEQAnswers applies a light hand, occasionally moving posts to more relevant forums and smothering all spam. In addition to the forums, a key asset is a large wiki on second generation sequencing packages.

The Ion Community is set up on a very different basis. It has two sections, each with its own membership restrictions. PGM Users is open only to registered owners of the sequencing system; Torrent Dev is open to that group plus anyone registered for the Grand Challenges. Each section has both discussion areas and documents. The site is flashy, though more than a few links are indirect detours to what you really want.

Now, there are plenty of examples of how maintaining some control over a site can be productive. I was recently trying to eradicate one of these nefarious fake antivirus viruses from our home computer, and on one major security software company's forums I found what looked suspiciously like a link to infect with one of these viruses. Keeping a single point of origin for documents can be useful as well, to reduce confusion. For example, if you Google around for information on 454 fusion amplicon design it is easy to find outdated information. We also wouldn't want any forum to devolve to the level of the Biotech Rumor Mill, which by its very nature must allowed unregistered posters, and as a result is a mudpit of insults and near(?)-libel.

But, Ion's approach is in my opinion strongly self-defeating. I won't go into detail here, but in the extreme form they have on two occasions (this thread and this one) argued for the suppression of information on PGM from SEQAnswers (I will try to tackle this soon, but after getting a chance to talk to at least one person with the Ion side of the story). But it's also easy to argue from a purely practical standpoint, rather than a philosophical one, that this approach is not doing their platform any favors.

The first problem is that the closed access means that you can't lurk on the cheap there; either commit to being part of the community or stay out. This prevents sucking people in slowly; for many the barrier of registering -- particularly since your registration does not become instantly active -- is too high a barrier. Indeed, I would offer as evidence for this that the first major PGM-tuned software package not from Ion or a commercial partner was apparently not spurred by the Ion Community, but by Nick Loman's post on assembly of Ion data. Nick's original post apparently received over 2K views, which must be at least an order of magnitude larger than the current Ion Community membership.

The second problem is the two layer design. Okay, I'll admit it -- it's maddening to be excluded from the PGM Users forum. How do I know that there are not discussions there I could either benefit from or contribute to? If someone starts a discussion there better suited for Torrent Dev, will it get bumped up? If so, who decides? But, worse than that is that many technical documents that I would find valuable are beyond that access barrier.

The third problem is I find these rigid definitions completely at odds with scientific reality. Just because I don't own a PGM doesn't mean I might try to do everything but run the instrument. As an example on another platform, my colleagues & I have run SureSelect on Illumina where we did all the steps except shear the DNA (outsourced), prep the flowcell and run the flowcell. Furthermore, in a modern collaborative environment, someone in one lab may own the machine but in another lab work up the samples. It's also not clear what any of this "security by obscurity" is buying; given the number of groups using Ion, whomever they're trying to hide the information from can certainly find a leak. Ion should heed the example of the music industry, which failed to provide legitimate means to supply a growing demand for digital music, and thereby spawned widespread illegal file sharing. Plus, many browsers are potential buyers -- people want to know what they are really buying in to and to start storyboarding what running an instrument would mean in terms of personnel and auxiliary equipment.

The fourth problem is discoverability. How do I find information? Well, for second generation sequencing stuff it is the trio of PubMed, Google and the SEQAnswers software wiki. PubMed is great, but many packages show up online long before they are published. Google is useful if you know what you are looking for, and SEQAnswers is great if someone has logged it there (and in general, once I see something I make sure it is logged there).

But with the Ion Community, those last two are problematic. The objections Ion has raised to links going to the interior of the community mean that I don't dare put such in the Wiki; but conversely a link just pointing to the community is not terribly useful. But worst, since Ion apparently won't let Google in their stuff is invisible to that valuable tool. As evidence, at the moment if you Google for information on their TMAP aligner with "tmap source code ion torrent", nothing from the community comes up (but threads on SEQAnswers do!).

Ion isn't the first, and probably won't be the last, company to try to have its I proprietary control cake yet eat its Internet openness. Life has SOLiD Community, Helicos has one for Helioscope, PacBio DevNet for SMRT sequencing and so forth. It takes some real courage to relax some control and invite the whole world to your party. Some folks would no doubt see view-without-registration as a loss of useful marketing data, forgetting that the openness of a site will itself draw in customers. Anyone launching a new platform should really ask themselves, will I be better off trying to control a rare destination for a few visitors, or perhaps just cultivate a sub-community over at SEQAnswers.

Saturday, May 14, 2011

Oh Would It Be Fun to Debug Strobe Sequencing!

Through the course of a month, I easily sketch out (on average) a dozen plus experimental designs in the context of my day job. Of course, only a very few of these are ever executed; many never get shown to another soul. By constantly pondering how I might tackle questions, I can keep in practice and present plans which are reasonably well thought-out. It also helps to have thought through things; sometimes a rejected plan will suddenly look like a gem due to some other result or change in priorities.

Atop that, I also sketch up a few which have nothing to do with my day job. Partly it is fun, and partly it is a way to exercise the process on areas I can be certain of my objectivity. A related exercise is to sketch out a business plan; I end up doing that a few times a year. Never have taken any further than a napkin; not only do I lack the necessary thirst for risk, but most are for businesses I'd be happier being a customer than an employee.

For example, after a recent lunch with a friend from graduate school I found myself again contemplating a question that had arisen at Codon in the context of gene synthesis. We didn't have second generation sequencing there (if Codon had survived, we certainly would have launched into it at some point) and had seen the hint of an interesting phenomenon (and practical problem) but could have never collected enough data to really nail it down. Now, with sequencing cheap on a large scale, and this being the perfect sort of problem for such sequencing (since one would need only very short reads), it was short-term obsession to work out how to run such an experiment. For probably $5K and a tiny bit of molecular biology, a nice little paper -- pity I don't have a slush fund to cover it. My apologies for not supplying any details; maybe I will someday have some found money to cover the project.

However, I did just get another idea that I might as welll be open about, as for one it will probably be solved in the near future and two there is no opportunity to actually work on it. So it can be fun to speculate in the open on how to address the problem.

Friday, May 13, 2011

Blogger glitch

Blogger experienced an outage which I found out about today. This caused my last post to disappear; any posts or comments in a certain time period were lost.

Thanks go out to the anonymous reader who posted a comment wondering where my last post went. Blogger (Google) did not notify me of the issue. The really disconcerting part is the Google cached version of the post was missing as well -- this suggested something very odd going on.

I often write posts directly in Blogger, so I don't have a backup. This was an exception & I did actually have the rough draft, which I would have posted if the old one wasn't restored.

Wednesday, May 11, 2011

Ion's Growing Pains

A recent In Sequence article indicated that Ion Torrent is enjoying strong initial sales. This bodes well for continued evolution and improvement of the technology, as LIFE will continue to smell revenues and opportunity. Ion has announced a number of improvements, but most aren't scheduled to arrive until the near future.

The challenge is for LIFE to keep executing on their plan. Already some issues have arisen; my own Ion experiment at a service provider is stalled due to a back-ordered template prep reagent (two weeks and counting!). This is a key reason to do pilots on emerging technologies with non-critical (but interesting) samples; I was bitten last fall by another backorder bug (that time, the paired end SOLiD reagents). Of course, this time it is even more complicated, as Ion is making a major change to both the underlying kits and to the software to process the data. It will be worth it if I get results anything like Ion's provided E.coli 314 dataset, which has about 4.5X the data of the original 314 chip spec.

Sunday, May 08, 2011

Ion Torrent's Data Quality Is Pretty Good (and Better Than Ion Claims)

One of the key questions around Ion Torrent, as with all new platforms, is what is the sequence data quality like. Now, that can be a loaded question but I'll ask a slight variant on it: how truthful (or accurate) is Ion at estimating base quality?

Quality scores are a useful adjunct to sequencing data and are commonly expressed as phred scores, which are the integer part of -10*log10 of the error probability. Any base caller needs to estimate these and many downstream programs, from aligners to assemblers to variant callers, rely on these quality values for their operations. In many cases, the individual quality scores are combined to generate some joint estimate of the error (I built one such model at Codon). These error probabilities come not from an infallible source, but are rather estimated from aspects of the raw data.

Saturday, May 07, 2011

Which numbers did I use this time?

Noted screenwriter William Goldman's second memoir on the film industry is titled "Which Lie Did I Tell?". The title is not a quote from Goldman, but rather what another movie industry said after getting off a long phone call he had taken in Goldman's presence. I'm a bit nervous I inadvertently strayed in that direction in my last item on Pacific Biosciences.

I was relying on memory for my numbers & in the course of writing things I think I also revised those downwards, not out of malice but rather an attempt to be conservative. As a correspondent pointed out, the first pass accuracy on the first commercial system is claimed to be 85%, not 80% as I stated. On the number of reads, I failed to update for the newer SMRT cells which are out; I said 10K and it's probably at least 3X that. However, I do have a bone to pick.

Monday, May 02, 2011

How Many More Machines will PacBio Sell This Year?

Amongst last week's news is the item that Pacific Biosciences has officially launched their SMRT sequencing platform. I'd too eye-deep in various projects to figure out how off schedule that is (I think nearly a year from their original target), but now it is launched.