Monday, September 23, 2013

Potential Sources of Drag on PacBio's Long Read Performance Trajectory

Over at there are two detailed blog entries on Pacific Biosciences entitled "End of Short Read Era?" (Part I  and Part II).  I've tweeted a number of comments on the technical aspects, but there are some more substantial thoughts reading these pieces helped me condense.
First, a small confession.  I sometimes worry I'll be labeled a PacBio partisan, because I've been a strong proponent of using their technology for microbial sequencing.  Their regular progress of improving performance by 2X at regular intervals, along with key informatics improvements, have made this a powerful platform for microbial genome sequencing and have really raised the bar on what one should expect from the typical microbial genome project.  If you are sequencing a few bugs and can generate high quality, long DNA, then I think it is clearly the best way to go and at prices that are quite reasonable. But, I think it is important to not dismiss short reads in this area.

One key issue that has been raised by many, and particularly recently by Mick Watson in his blog, is that the cheapest short read platforms deliver bacterial genomes at a far lower cost than PacBio.  He estimated the difference at 10X, using publically available pricing at core facilities. Now, the important caveat here is that this statement is whether the difference between the finished products matter.  For most microbial genomes, as shown by the recent Koren et al paper, current PacBio reads can close every replicon (chromosome or plasmid) into a single contig.  In contrast, many repeats in microbial genomes are longer than what can be resolved with standard short read sequencing, and even with mate pairs it is my experience that there are regions which are difficult to resolve.  So, how much do those areas matter?  For some types of study, the answer is they are critical but for others more of a nice-to-have.  Or, more importantly, for some studies getting a pretty good sequence (lots of contigs) on many isolates is more important than near-finished sequence (contiguous, but with some indels remaining) on a smaller number of isolates.  The a few good references and many resequencings model has a lot of life left in it.

The nature of PacBio's remarkable march is also going to limit the price/performance, at least until some other problems are tackled.  With the exception of the doubled imaging area in the RS I to RS II upgrade launched this spring, essentially all of the performance improvement in PacBio's platform has come from pushing the read lengths to distances which still boggle the mind.  Thirty kilobase reads???!?!?!?  Clever informatics, such as the HGAP algorithm and BLASR aligner, can leverage that length to overcome the high (~15%) error rate.  Those rare mega-reads illustrate what the platform is capable of; much of the further development is in increasing their frequency.  

But pushing in this direction has its challenges.  Back-of-the-envelope, PacBio is currently about 7-8 doublings away from the mythical $1K human genome. If the reads now are averaging 10Kb, and at least 2X improvement is gained by loading the flowcell at better than Poisson (which has been discussed openly but is not yet announced for a delivery timeline as far as I know), that would require reads over a megabase in length. This is certainly not the plan (presumably another optics upgrade will come at some point), but realistically there are huge challenges even getting into the 50-100 kilobase range.

A key truism is that it is impossible to read novel sequence longer than the insert size.  As PacBio pushes into the >20Kb range, they will be outside their current technology for shearing DNA (Covaris' clever G-tube devices).  Shearing above that will probably mean going to devices such as HydroShear, which have been used extensively for cosmid library construction.  But nobody made cosmid libraries on the scale of a serious sequencing operation; that particular device is notorious for clogging.  Going beyond about 50Kb means going above the size in which DNA starts shearing from simply being a long molecule.  Again, techniques were developed to generate and handle DNA in this size regime (to make YAC and BAC libraries), but it is another leap upwards in difficulty and labor.  No more simple spin column kits! Finally, for some key samples, such as most clinical cancer samples (and archival pathology samples of all types), going long is useless because the formalin fixation / paraffin embedding process shears DNA.

For small genomes, these future performance improvements may hit a wall: the flowcell and library costs.  Barcoding on the RS platform is still bleeding edge, and my one experience with it did not go well.  The challenge is that if the polymerase "jumps the gun" and begins polymerizing before imaging begins, then the barcode in the initial adapter will be missed.  PacBio has worked on a "hotstart" technology to minimize this, but in our experiment we had many reads lacking barcodes.  One solution is to read the barcode also at the other end, but that requires ensuring that reads get that far, which means giving up some effective output as some reads "turn the corner" and read back into the same insert.  If output is huge, giving up a bit there might be tolerable, but otherwise for most projects PacBio runs the risk of having a floor price of 1 library construction plus 1 flowcell (if you have samples you are certain you can tell apart, one could just throw them together and decode each contig at the end).  Any genome you want for $500 sounds great, but if you have a lot of small genomes to sequence you might want to do better.

Library construction cost is another potential speed bump.  Mick Watson was careful to use published prices that can actually be had, but a number of papers have claimed generating short read libraries for $1-$5 rather than a few hundred that a core will charge.  Now, this requires making lots of libraries, just as getting really dirt cheap read costs per samples requires piling many samples into a short read flowcell.  But, if these library costs can really be reached, then getting a draft genome on Illumina could in theory be had for <$10/sample (and perhaps closer to $5), as long as you are sequencing hundreds or thousands of isolates. 

The silliest (in my opinion) perceived issue with PacBio, at least in the short term, is the cost of the instrument.  Yup, it's a beast on the budget.  But as I have emphasized before, there are many good core labs and commercial providers which offer it as a service, and based on my experience the capacity of that installed base is not fully utilized.  

I've written all of the above assuming no major shifts in the sequencing technology landscape.  Given the amount of investment in new technologies, such as electron microscopy and nanopores, it will be surprising if no new technology emerges over the next few years.  I really hope to witness another technology eclose, but the when of that and what performance will follow remain a mystery.


Anonymous said...

Wouldn't the circular consensus feature mean that insert fragment length doesn't need to keep pace with read length to still be useful throughput increases? Of course the longer the fragment the easier the assembly, but you could still use all those additional reads with shorter fragments. More accuracy for the same coverage too; and would improve barcoding.

Anonymous said...

Any thought on competing technologies (such as Ion Torrent)?

BTW, at $5/sample for Illumina, bioinformatics will be rate limiting factor.

Keith Robison said...

I need to think this one through further, but to some degree I agree that this will be a route to getting accurate barcodes and thereby packing lots of small samples on a SMRT cell. It will mean giving up significant output -- I think. That's the part I really need to work out.

Anonymous said...

Do you mind posting some references to the purported $1/sample prep? Thanks!

I was able to find this one

And just wondering if you have any more.

Any thoughts on Moleculo? The Long read service is out but requires >100Mb genomes, which really limits its application for microbial work.

Keith Robison said...

Responding to some of the comments
1: Ion Torrent may have a niche where either per run cost or speed are valuable, but their data quality remains a concern. Worth watching, but their performance improvement is no longer coming out regularly

2: The low sample prep reference is turning into an embarrassment; I was certain I could retrieve them, but so far not much luck! I will keep trying. One that I know isn't published is one of the companies with a sub-microliter-capable liquid handler showing that Nextera can be scaled down about 10X, which I think gets into the $5/sample range.

3: Moleculo is a very interesting technology. However, it relies on long-range PCR, Nextera, Illumina and an assembly algorithm, and so really serious repeats (either complex simple ones or long exact ones) will remain a challenge for it as will regions of strong nucleotide composition bias

Anonymous said...

Here's my naive thinking on the circular consensus. Say we can only sheer at 20kb effectively but we size select well and get a mythical perfect library of 20kb fragments. We load one 150k ZMW cell and about 60k ZMW's load with one 20kb fragment each. So that's 1.2 Gb of nucleotides to potentially read.

If the read length is 20kb, then we read all the fragment length and yield the entire 1.2 Gb, at 1x coverage. If the read length is 40kb, then we read each fragment twice in a circle. So we've read 2.4 Gb but really that's the original 1.2 Gb with 2x coverage. Though we get better error correction than just 2x coverage would normally yield, so it's worth more than 2x coverage possibly. So if our ultimate goal was to get 200x coverage from the experiment, the number of cells we need to use has been cut in half or more even though our fragment size hasn't changed. Overly simplified train of thought though?

Keith Robison said...

That's the line of thinking I'm working on, but the devil is getting that size selection -- there will be further losses with real distributions.

I need to sketch this up with some reasonable numbers.