Saturday, May 07, 2011

Which numbers did I use this time?

Noted screenwriter William Goldman's second memoir on the film industry is titled "Which Lie Did I Tell?".  The title is not a quote from Goldman, but rather what another movie industry said after getting off a long phone call he had taken in Goldman's presence.  I'm a bit nervous I inadvertently strayed in that direction in my last item on Pacific Biosciences.

I was relying on memory for my numbers & in the course of writing things I think I also revised those downwards, not out of malice but rather an attempt to be conservative.  As a correspondent pointed out, the first pass accuracy on the first commercial system is claimed to be 85%, not 80% as I stated.  On the number of reads, I failed to update for the newer SMRT cells which are out; I said 10K and it's probably at least 3X that.  However, I do have a bone to pick.
Getting actual numbers on a lot of these new platforms is very challenging; the PacBio press release is effectively devoid of performance statistics.  An In Sequence article on the first commercial shipments mentions a total megabase yield per SMRT cell as 35-45 Mb, but doesn't report the read length or number of reads.  This is a real pain for those of us trying to either masquerade as journalists or design experiments.  While for genome or exome sequencing (and to some degree transcriptome sequencing) that total sequence generated number is important, for a lot of other applications (and particularly the ones I'm contemplating) the number of reads is more critical.  Now, of course if I know the mean read length I can back-calculate the number of reads.  Or, if I know the loading efficiency I can back-calculate.  But PacBio (and Ion as well) love to tout the number of sensors on the chip (probably because it is a big number and fixed) but that efficiency number is harder to find.  

Calculating from read length is tricky just since there are several estimates floating around.  If I assume 1000bp reads, then we're talking 35-45 thousand reads per SMRTcell.  If 1500 is the right number to use, then that changes to 23-30.  Both are about 3X higher than what I reported; very stupid I failed to account for the new chips.

Of the two mistakes, the quality one has a lot less impact.  It's hard to envision applications for such long reads which are really enabled at 85% but not at 80%; indeed I am convinced the reads could much worse and still be useful.  But the number of reads was a painful mistake; this really affects the economics of some projects.  For example, it means that about 3 SMRT cells are equal to the original Ion 314 chip in terms of number of reads generated (yes, that is a comparison of sequencers with very different read length and accuracy numbers), though Ion now in some of their materials is claiming 400K reads per 314 chip.  

So, unlike Goldman's two memoirs, not funny at all.  I slipped up & need to do better -- and particularly if I am deriving numbers I should show my work, so that wrong input values are easy for others to spot.  On the other hand, the 3X-4X better read counts than what I described still don't really change the landscape enough that I would change what I wrote in the large.  Getting high coverage of a mammalian genome is still going to be pricey; 40Mb is about 1% coverage so ~4000 SMRT chips would be needed for 40X coverage; this just won't be economical to do routinely.  

Anyone have some hard numbers from real datasets?  Or better yet, real datasets they're willing to share?  Those would be far better than trying for finagle the right numbers out of some press releases.


cariaso said...

When you figure it out, please consider adding it to

johnomics said...

Deepak Singh from PacBio came to see the GenePool in Edinburgh last week and gave us some figures. These are from our notes; I believe they are accurate but please correct me if something is wrong.

Each SMRT cell has 75,000 zero mode waveguides. The machine can hold up to 96 SMRT cells for one run. So at full capacity, the machine would produce 7,200,000 sequences in one run.

Each cell takes 45 minutes to run, at 1 base / second, producing reads of around 1.5Kb. (45 * 60 = 2,700, so presumably not all the ZMWs can continue to this length.) It is expected that the enzyme will run at 3 bases / second by the end of the year.

- reagents are currently 'passively loaded' (washed over the cell) rather than actively loaded (introduced to each ZMW), so only 23,000 of the ZMWs on any one cell will produce useful sequence. No roadmap for when active loading will be introduced.

- The cells have to be run in series, not in parallel. So if the machine was running at full capacity, it would take 72 hours to run 96 wells at 45 minutes a well (1 base/second), or 24 hours at 15 minutes a well (3 bases/second).

- But at the moment, the reagents only last long enough for 16 wells to be run. Again, no roadmap for when this will be increased.

This means CURRENT throughput is 16 * 23,000 = 368,000 sequences, which take (16 * 45)/60 = 12 hours to run.

This is either 552Mb of 85% accurate sequence, or, if using circular consensus (3-5 passes of ~500bp fragments), ~184Mb of 99% accurate sequence.

Anonymous said...

Have you had a chance to review this blog post?