Thursday, March 31, 2016

Reflections on And The Band Played On

Fellow blogger, colleague and science history buff Ash pointed out to me recently that Randy Shilt's And The Band Played On for Kindle was on sale.  I hadn't read the book, nor seen the miniseries, so I snapped up a copy.  It's a good read -- though at times a hard one - I don't believe I've ever read another work of non-fiction where such a high fraction of the named individuals are dead by the end of the book

Wednesday, March 30, 2016

Who Wants To Write A Review Article?

Yes, this is a solicitation.  I'm on the Editorial Board of the journal Briefings in Bioinformatics,.  I'm looking for authors who would like to write high-quality, compact reviews.  If you are interested, or you want a little back-story, then keep reading.

Tuesday, March 29, 2016

At the Edge of The Cloud

I've used cloud computing at Amazon Web Services (AWS) off-and-on now for over five years.  The cloud has all sorts of handy advantages -- flexible access to large amounts of compute, inexpensive access to any flavor of Linux you wish, the ability to guiltlessly kill a huge server you just fatally cratered with the wrong command.  And until now, I''ve always been able to find machines that fit my needs -- perhaps sometimes just fitting or with a bit of compromise   But, now I've hit the wall: nobody at this time offers a really serious cloud machine with 500Gb of RAM.

Friday, March 25, 2016

Selective sequencing: A Programming Opportunity!

I ask a bit of indulgence from my regular readership for this piece, as I am going to explain a number of things in depth that probably will be very familiar to them.  My hope, perhaps fantastic, is that this piece will get out to some who are not so familiar with such topics, as I think the problem at hand might be very fascinating.

Friday, March 18, 2016

PacBio's big splash

[18 March 2016 -- my original inclusion of the Pac Bio marketing image 6 years ago was claimed to be a DCMA violation -- I've simply removed it, though I do think this would fall under fair use ]

The Pacific Biosciences instrument is officially unveiled now, with those lucky/smart (or SMRT?) enough to go to Marco Island filling in all of us not in that position. Sounds like a great lot of hoopla, though they didn't drag the Hornet for the splashdown.

First of all, it's a beast. "In this corner, weighing in a nearly an imperial ton...". Too bad their marketing picture has nothing good for judging the scale --
it's apparently 6.5 feet wide.

Kevin Davies at Bio-IT World has a wonderfully detailed article and there is a lot of nuggets in the Twitter feed. Anthony Fejes has two different sets of notes out -- one from a workshop and one from another speaker; Dan Koboldt has some good notes too (and if I haven't shouted out your notes, it's probably because I'm oblivious -- leave me a comment pointing to them). There was also a little bit of PacBio science in Elaine Mardis' talk (she's on their SAB) -- Anthony's notes & the twitter feed.

Okay, besides worrying about the capacity of floors & freight elevators, what's new? Well, not much on error rates from PacBio (apparently in the Q&A their presenter executed a jig, tango, waltz & rumba when asked) -- though the Mardis talk described resequencing samples of PacBio that had been done before by Illumina -- and the results are quite good. Another important note is that their system doesn't seem to have much bias in terms of composition -- bias against hi/lo %GC has been noted in all of the amplification-based systems and can be a serious problem.

There's also a lot of talk about being able to distinguish various modified bases by their effects on polymerase kinetics. PacBio has also demonstrated direct RNA sequencing (substituting a reverse transcriptase for DNA polymerase) and is talking about watching proteins being made. I haven't quite figured out why you'd want to do that last one, but presumably it's for more than a cool Nature cover.

Read lengths decay exponentially -- but with lots around 1Kb and quite a few around 5K. The big problem is apparently oxidative damage to the polymerase triggered by the laser -- so they are working on both getting the oxygen out of the system and engineering hardier polymerases (the sort of biz I used to be in). Their strobe sequencing mode -- in which the laser is turned off to enable elongation in safe darkness -- enables multiple reads separated by long gaps.

The instrument definitely raises the bar on sample prep -- it's apparently entirely automated within the monster. YEAH! A machine I can delude myself into thinking I could run it! One drawer takes the SMRT cells and another the DNA samples -- 500 ng of each. That doesn't sound like much (it's at least better than the 5-10ug most library prep protocols call for -- except the ones looking for 20-30ug), but it seems you don't get a lot from each sample.

The number of reads per cell isn't huge -- but you're still getting about 2 E.coli genome equivalents by my calculation. This is a bit undersized for a lot of applications -- but grand from many others. Mardis' talk discussed using PacBio for sequencing PCR amplified resequencing samples -- this would appear to be right in the PacBio sweet spot. Perhaps a few hundred long PCR products could be packed into one SMRT run and still get many hundreds of reads per sample -- well, maybe pack fewer amplicons.

What might be other good uses? Clearly metagenomics and similar. I just saw a posting on a professional board of someone pondering multiplexing hundreds of samples for an Illumina run (the current barcode schemes are for a few orders of magnitude fewer samples). Blitzing each sample through the PacBio instrument would seem to be obvious -- if the error rates are acceptable. Folks doing whole genome sequencing of small genomes will love having PacBio to generate scaffolds. For bigger genomes, it may just still be too expensive to get much coverage ($100 a SMRT cell sounds cheap, until you start multiplying that out for the numbers you need) -- but perhaps not (much too fried to do that calculation at the moment).

RNA-Seq might be a bit trickier. If you need 500ng of input material, that's an awful lot of ribosome-depleted or poly-A RNA. Plus, getting only tens of thousands of reads, making it hard to see lowly-expressed messages -- but very long ones, perhaps priceless. But, if you can get tons of RNA, then 100 SMRT cells would be about $10K and offer similar depth to what you can get today with Illumina but with those super long reads.

Now, who is this going to crimp the most? The instrument is clearly a ways from really threatening Illumina & SOLiD for the large genome market. 454 is a likely candidate to see growth pressured -- though between the new lower-priced "junior" and both PacBio's $700K price tag and their inability to flood the market with instruments, this will be ameliorated.

PacBio might have almost as much effect on the surrounding sequencing ecosystem. Making library prep reagents for this system is not going to make you lots of money! But, there will be a serious niche for targeted sequencing -- though with the scale it will probably require some rethought. Stuffing the whole exome into this doesn't really make sense -- if there are ~250K segments of the genome to read & you want 40X coverage of each, that's a lot of SMRT cells. But, intelligently chosen gene sets totaling about 500 regions (or around 20-50 genes) with pre-validated reagents -- now that might be a market (though one which might have 1-2 years of life -- better get cracking!). Simpler library prep will also go nicely with some of the enrichment systems -- a bugaboo of hybridization systems can be "daisy-chaining" of fragments via the amplification adapters -- but, on the other hand you don't get 500ng off an array or in-solution system without amplification. As with many disruptive technologies, it won't fit a lot of bills but will nibble off various parts of the business that are individually small but significant in aggregate. As noted above, RNA-Seq might be an initial success story for PacBio -- when RNA is abundant.

IMHO, PacBio does need to get some papers out on applications (Mardis' group apparently is close to having one) and make sure that the next tranche of installations not only includes the Sanger & BGI, but that there are also some core labs or commercial providers. Also, they need to start pumping data into the public domain -- while they signed a bunch of commercial software providers up, it is definitely out of academia that you find the most radical advances. Plus, there are a lot of now well-entrenched open source tools that need to be tested with the new kid. Even simple things like the semi-standard SAM/BAM format are going to need tweaking -- SAM/BAM stores all sorts of information on read pairs, and the strobe sequencing can generate many more than 2 tags per DNA fragment.

Of course, we have to wait another half day plus to find out what Ion Torrent is really delivering. That could really shake up the landscape -- at least the mental one.

A huge thanks to all the bloggers & twitterers for pouring out so much information. I'm still getting used to scanning past the retweets (is there a way to condense them) and there is the occasional shock-to-the-system (how could anyone in the field not have heard of Rodger Staden?!?), but that's a tiny price to pay for such fascinating stuff.


Monday, March 14, 2016

A Mosquito ExAC?

Okay, there's a scheme for a crazy big genomics project has bitten me, infecting my brain.  It's definitely not something I'm in a position at all to execute on, but I throw it out as an idea in case anyone finds it useful.  And admittedly, it is pretty much stealing straight from the ExAC human exome aggregation project, which contains huge numbers of human exomes.  Behind all those is a lot of phenotype data.  Now, inspired by recently re-reading Laurie Garrett's The Coming Plague and also faced with daily news items on the Zika virus epidemic, I've had this question: what if the same approach were applied to key disease vectors?

Wednesday, March 09, 2016

Oxford's Riposte To Illumina Trade Action

Along with the "No thanks, I've already got one" online session, the other big Oxford Nanopore news is the public release of Oxford's response to the trade complaint filed by Illumina which was attempting to exclude all Oxford Nanopore devices from the U.S. markets.  Nature News' Erika Check Hayden has posted the document on Dropbox, which was a big help.  While no documents from Oxford's side concerning the simultaneous patent lawsuit have yet surfaced, it is reasonable to expect that it will use many of the same arguments.

Tuesday, March 08, 2016

Oxford's "No thanks, I've already got one"

Oxford Nanopore today hosted a Google hangout titled "No thanks, I've already got one".  Only this morning did it occur to me I could have re-watched Monty Python and Holy Grail and scored it as blogging-related time! Oxford CTO Clive Brown went through a number of interesting (and in many cases, long-awaited) announcements on the release of multiple key upgrades to the platform (note: unless otherwise specified, images swiped from ONT).

Thursday, February 25, 2016

Digging into the Illumina Lawsuit vs. Oxford Nanopore

Illumina's and University of Washington's filing of a patent lawsuit and related trade complaint against Oxford Nanopore made big news yesterday, with nice coverage from Mick Watson, GenomeWeb, Nature's Erika Check Hayden, Technology Reviews' Antonio Regalado,  BioIT World's Aaron Krol, and venture capitalist Vishal Gulati. Each of these covers the onetime partnership between the two companies and their acrimonious parting of ways.   Oxford Nanopore released a short and pithy response. Having failed to get an early jump on things, the ground is already well plowed.  So my sloth and inertia have forced me to take an unpleasant route I usually spend great effort avoiding: actually reading the complaints and the two key patents licensed from Jens Gundlach's group at University of Washington (US8673550 and US9170230 ) they cite.

Wednesday, February 24, 2016

Amplification-free, library-free sequencing? NanoString wants to be It


Perhaps the most unusual new technology to be unveiled at AGBT16 is NanoString's new approach to sequencing, which is in very early stages of development.  Called Hyb And Seq  the process is remarkable in being a purely hybridization-based single molecule method -- absolutely no enzymes are harmed during the operation of the system. That's remarkable -- the only enzyme-free (or nearly so) sequencing approaches to deliver serious amounts of data into Genbank are Maxam and Gilbert approaches (including Church's genomic sequencing and multiplex sequencing), and even those typically required restriction digestion of the target.

Saturday, February 20, 2016

AGBT16 Storify Completion & Rate Limits


AGBT16 ended a week ago, but for various reasons I'm just now catching up on my Storify project.  A vacation was in there but also some tool building.  As I was griping about the pains of organizing the tweets manually, Brian Krueger suggested what was already dawning on me (but it helps to be poked -- professional embarrassment is often a stronger motivator than pure annoyance) -- I needed to stop doing this purely manually.  So, off to deal with pulling in Tweets automatically and at least doing some organization programatically.

Friday, February 12, 2016

10X Launches Chromium (#agbt16)

10X Genomics launched their approach to obtaining long-range genomic information last year with a big financing and some exciting preliminary data at AGBT15.  Now they are back at AGBT16 with an upgraded instrument, improved biochemistry, new software and new applications, along with a trio of major co-marketing agreements and a splashy hire and a raft of both published and unpublished data from academic collaborators.

#AGBT16 Day 2: How is AGBT On Twitter Like Sequence Assembly?

I spent a bunch of time yesterday going through the Tweets from AGBT.  For me personally it is a useful exercise, plus I'll have it as a resource to go back to for future posts.  But the time and pain involved definitely had me sometimes questioning the wisdom of attempting this.

Wednesday, February 10, 2016

AGBT Begins (with bonus Storify Jeremiad)

Just finished my last Storify for tonight from AGBT16, and boy am I wondering how sustainable this will be.  The "problem", which is wonderful to have, is that the number of tweeters has grown substantially, and so there is a wealth of material to attempt to distill down. There's also the desire to make sure I don't further propagate the spam which sneakily re-tweets the occasional item. The other problem has been my tools

AGBT16 Preview (aka The Non-Attendee's Lament)

AGBT16  starts this today but I'm again not there. The usual complex set of personal constraints (or imagined ones) kept my hat out of the ring this year, and now I'm again torn between wanting to be there and why it would have been hard.  Easy would be leaving our most recent snow and ice storm and the general cold weather.  A bit harder is it is early in the school term, and back-to-school night is Thursday -- plus I spent last night chatting with a candidate for a local office (School Committee) at a low-key campaign event.  The big, and unforeseeable, challenge is that the other half of Starfleet's bioinformatics group is out on paternity leave, and while I'm proud of how much quotidian work I get done during conferences, it still isn't the same as being on full duty.

Tuesday, February 09, 2016

Why do we purify DNA the way we do?

An interesting conversation on Twitter on means for purifying DNA for PacBio and the risks of phenol-chloroform extractions restarted some pondering on the historical contingency of experimental techniques.  Or, as the title says, why do we (today) purify genomic DNA the way we do?

Thursday, January 28, 2016

First on an occasional series on high school biology: Complexes

TNG has biology this term, so I will be (at erratic intervals, of course) sometimes venturing my thoughts on the teaching of that specific subject.  I think I never quite got around to venting the last time he had a biology unit, back in middle school, which among its sins was still teaching the seven kingdoms system of classification, which for the love of Woese is absurd.

Friday, January 22, 2016

Does any analytical program really care about the order of paired end files?

I was recently experimenting with C. Titus Brown and company's khmer package and hit an interesting little snag.  First, I had my usual problems with installing a Python-based program, which were solved by the totally counter-intuitive absurdity of actually following the installation directions precisely. Armageddon is certainly near if random shortcuts and assumptions can't be relied on to get the job done.  But once I had it working for a simple test case, I crazily tried to build something -- and that's when a new, maddening bug cropped up.

Monday, January 18, 2016

Illumina's MiniSeq Giveaway

This morning, Illumina announced a Scientific Challenge program as part of the launch of the MiniSeq sequencing instrument.  Three prizes will be given away, with 3 sequencing runs on MiniSeq as 3rd prize and a MiniSeq plus reagents for 3 runs as second prize, and a MiniSeq plus reagents for 3 runs plus a Mini Cooper automobile as the grand prize. Entrants in the contest will submit a proposal for how to they plan to use the instrument (but not the car; if you support future reagent purchases by being an Uber driver, that's your business). There will also be a set of iPad mini giveaways based on recommending colleagues to enter the contest on social media; if your recommendation results in an entry, then you are entered in the iPad giveaway.

Tuesday, January 12, 2016

Illumina's Unveils Firefly

Illumina third big announcement around JPM is to unveil Project Firefly, a semiconductor sequencer which will use existing SBS library preparation and a derivative of SBS chemistry.   Slotted with a price point ($30K), physical size (small pizza box ish?) and data yield (4M reads, 1Gbp data)  below the just announced MiniSeq , Firefly would be two small boxes which could stack: one for library preparation and one to run single channel sequencing.  The flowcell would use ordered arrays, layered atop the semiconductor sensors.  Launch is proposed for the second half of 2017.