Friday, February 24, 2012

Why Oxford Nanopore Needs to Release Some Data Pronto (Besides Bailing Me Out)

Last week's piece on Oxford Nanopore got a lot of attention and a lot of comments, which to me is the true mark of success in this space (discounting the higher than normal spam attempts).  A couple of folks were kind enough to tweet a link (now captured in Nick Loman's wonderful tweet archives for AGBT 2012), and it was also picked up by Matthew Herper at Forbes, Dan Kobolt at MassGenomics and others (apologies for all I haven't shouted out).  It also can't be denied that some of those comments felt I had been too generous / gullible with Oxford Nanopore
It's always a trick writing pieces about new technologies, and I'm aware of a number of factors which can influence my take on a given story.  

An obvious one was Oxford granting me an advance interview with top personnel.  Being treated like a real journalist is one thing, and getting to write in advance is a big help.  Knowing a big secret, even for fewer than 24 hours, can be a bit heady.  There is always a risk that at some level, perhaps unconscious, I'll lean away from offending a company out of misplaced gratitude or worse.

On the other hand, there is a prior that must be assigned to whom I was speaking.  I've known Ewan Birney for a long time, having been very impressed with his Dynamite paper at ISMB back in 1997.  I even hosted him in my apartment when I was trying to recruit him to Millennium.  Ewan has done a lot of amazing stuff and is pretty damn blunt in his speech; he's someone I'd trust to not be a tout.  I hadn't met Clive Brown before, but he has a reputation as a straight shooter, and perhaps more importantly he and much of the same team launched the Illumina technology; he 's someone with a favorable track record.

An even more complicated calculus comes from the performance specs.  Now, as regular readers of this space may have discerned, I believe that a fair evaluation of any sequencing technology is dependent on the application it is going to be put to.  Any given technology will be right for some applications and wrong for others, and so it generally irritates me to see people dismissing a technology because it won't fit their pet application.  But, the flip of that is I may go easier on something that looks like it might fit my problem du jour

If I was still back at Infinity, I'd have a mild interest in Oxford because the claimed error rate is pretty high for reading out clinical mutations, though it must be said that most of those errors are deletions in specific (though unspecified) sequence contexts.  Much as with Ion Torrent's homopolymer issue, this can sometimes not be an issue if you are looking for in-frame missense changes; if you get a frameshift you toss the read.

But, for my current shop Oxford pushes an awful lot of buttons.  A slightly less enigmatic website has gone up for us, which gives me a better idea of what I should and shouldn't reveal.  We're doing a lot of de novo sequencing of small genomes, and with just fragment short reads they just don't assemble well enough.  So, one must make mate pair libraries -- which are DNA hogs, slow, labor intensive and sometimes still don't solve the problem.  So perhaps you go to longer mate pairs, meaning more DNA, more work and more time. I was going to be thrilled with Oxford if it offered substantially worse than PacBio accuracy; the idea of getting 50Kb reads of any sort from minimal input is just too intoxicating.  Throw in an apparent intolerance to DNA purity and you really have something.

There's also, of course, just getting too drawn in by the technical details.  Each one of these sequencing technologies is pretty amazing, with all sorts of gadgetry bordering on sci-fi.

But in reflection, I wouldn't change much in the piece, other than to emphasize more "if"s and calls for data release.  One person I consulted wondered if Oxford could have potentially overfitted on lambda and phiX, and I wished I had thought of that on my own.  I must say I am glad I did not go with my original title.  I thought long and hard about the title, as I like something that fits and ideally has some word play.  The initial thought was terse (8 printing characters!) and pseudo-alliterative.  But, it used an expression someone my age probably shouldn't attempt and would certainly fuel the feeling I wasn't being objective.  Yes, it was fun to think up but the wrong title would have been: ONT? OMG!

In any case, Oxford really needs to release data pronto to allay the skepticism.  It won't cure it entirely; many will think it has been cherry-picked.  This has been a refrain with each new sequencing technology: in previous years it was PacBio or Ion who were slow to release data (which makes Jonathan Rothberg's complaint about no data a bit, well, interesting).  The only real solution to that is to let some independent labs generate data.  Of course, if you want that data propagated quickly and publically, you might be wise to let a blogger take a crack at it...

There's a second reason Oxford should think hard about getting data out soon, no matter how messy it is.  I believe that bioinformatics makes a difference.  Support by the academic software community is a critical asset for any sequencing technology.  This isn't to say that commercial tools aren't important, but in the end they offer only a narrow spectrum of capabilities.

Now, many tools are relatively platform-agnostic.  However, there are a lot that are not, sometimes in subtle ways and sometimes not so much.  This was very clearly true with SOLiD, which had a completely different data format (colorspace) which required special handling.  Tools supporting colorspace were slower to appear than Illumina-oriented tools, and it can't have helped SOLiD.  

Specialized error modes are an even bigger problem.  For example, Illumina reads have a low indel rate and are therefore very suitable for de Bruijn graph strategies for sequence assembly.  Contrast that with 454 and Ion, with their frequent indels.  There is a huge flock of academic (and mostly open source) assemblers tested and tuned on Illumina data; only a handful for 454 and Ion.  Worse, what many claim to be the best assembler for Ion data (Newbler)  is controlled by 454, and therefore not available to most Ion users.

I believe this story has repeated itself with the other two existing platforms, Helicos and PacBio.  Both have tried to have semi-open user communities (as has Ion), but neither has much specific software support.  I believe there is a virtuous circle which few platforms have succeeded in: the availability of a wide variety of tools drives more use of the platform, which in turn drives the generation of more tools.  Diversity of tools for a given application can be confusing, but can also lead to improvement (competition is a good thing!).  Diversity of applications supported is even more important, as someone's minor niche may grow into a major application.

So, if Oxford wants to get this virtuous circle rolling, they need to start feeding it data ASAP.  A variety of data, from a variety of organisms.  ONT data is likely to fit poorly into most tools; they won't be prepared for the read lengths and the idiosyncratic error profile will require adaptation.  But, the ONT long reads will spur novel applications, which will need novel tools to support them.  Indeed, I generally feel that a technology like Oxford's can succeed primarily by generating completely new applications that are well-suited to it's idiosyncracies; applications enabled by Oxford's strengths and tolerant of its weaknesses.

A last thought: for any company considering doing this, please learn from the laggards.  Ion, PacBio, SOLiD and Helicos all tried to create custom environments for data release and software developers.  Areas requiring registration and learning new passageways.  Your team might think those systems are intuitive and  simple, but they are invariably a pain and off-putting (try using wget to pull data from IonTorrent's site!).  Do yourself a favor: build a registration-free data release site and just use SEQAnswers as the discussion forum.  Your marketing people will grouse you've missed an opportunity to collect data, but just ignore them.  You will have something far more precious: lots of ADHDish programmers trying to play with your platform's data.

6 comments:

Shawn Baker said...

Hey, don't just blame the Marketing folks. I and most of my marketing colleagues at Illumina pushed for very open systems. Resistance almost always came from outside our department.

cariaso said...

Name names, or STFU & GTFO.

You've got weasel words 'most' & 'almost' in both of your sentences. Marketing has earned its distrust.

Adam Smith said...

Posted a feature on this topic yesterday - ONT are remaining very tight lipped!

http://www.elements-science.co.uk/2012/02/disrupt-scientists-probe-bold-gene-sequencing-startup/

Nick Loman said...

I think Oxford Nanopore are wise not to release data at this point. Because my reading of the MinION (particularly) as a game-changer is to do with three things: the disposable device, amplification & library prep-free single molecule detection and super long reads. Run-until is also a nice feature. I think the people involved (Clive Brown predominantly) have sufficient reputation to trust this isn't pure vapourware - other start-ups don't have the same pedigree.

So my feeling is that if they put out some 96% accuracy dataset immediately this becomes a like for like comparison with other platforms which isn't telling the whole story.

Keith, I had similar issues when constructing my piece but am very comfortable with it.

Of course I am always prepared to eat my words later on, I've done it before!! :)

Shawn Baker said...

Cariaso, “STFU & GTFO”?! Wow, that’s a lot of hostility aimed at a lighthearted defense of marketing. You want me to “name names”? I’ve posted under my real name and named my former employer. What more do you want? Sr. staff and the legal team were the ones who generally were more ‘conservative’ when it came to sharing information.

I can give you two examples. One success was when I was able to create the site www.switchtoi.com, which provided detailed info about our gene expression arrays with no login needed. (The site is still active, but probably hasn’t been updated in a while). One failure was when a customer asked for the ‘zipcode’ sequences used on our beadarray products. Legal ended up denying the request (not sure why, probably just erring on the side of ‘caution’.)

Yes, I said ‘most’ and ‘almost’ because that’s an accurate portrayal of the situation. Did some people in marketing at one time or another push back against ‘open information’? Sure, of course. But the vast majority of the time, marketing (especially product marketing) was in favor of keeping things open.

It’s unfortunate that marketing has earned your distrust (but maybe not surprising; examples of bad behavior abound). You should find the good marketers and befriend them – the good ones can be extremely helpful.

In terms of this blog, I agree with both Keith and Nick: ONT should release data to the public as quickly as possible to help start the ball rolling on the creation of open source tools (but also that ONT may be reluctant if they feel they can substantially improve the quality over the next few months).

GroovyGeek said...

Legal... oh how I "love" legal. The people who will "negotiate" for months with one another without ever staking out a firm opening position. Or who will argue incessantly about placement of trivial words in a text. The job of a good lawyer is basically described as "find a tiny technicality ANYWHERE and then exploit it beyond any reason". Yes, I have been both on the winning and loosing end of this idiocy, I would gladly live without it.