I recently remarked that it was surprising that PacBio had not jumped on the German E.coli outbreak strain, given CEO Eric Schadt's professed interest in biosurveillance. Dr. Schadt was kind enough to spend a half hour discussing the topic with me last week, in the wake of PacBio releasing de novo assemblies of this strain and 11 other pathogenic E.coli strains (6 having never previously been sequenced) on the PacBio DevNet website. Dr. Schadt also has a detailed blog post on the project.
In both the blog post and our conversation yesterday he explained why PacBio wasn't in the forefront of this effort but still decided to jump in. Dr. Schadt was certainly aware of the outbreak; he was in Germany at a conference when the public alarm was building. But initially, the company was trying to stay focused on the commercial launch of their instrument. However, after seeing the first public assembly come out at several thousand contigs, they consulted some academic collaborators and decided to run their own sequencing, with a goal of providing a much less fragmented assembly. The initial thought was a hybrid assembly containing both PacBio and short read data, but after another group generated a high quality short read assembly the emphasis switched to a PacBio-only assembly.
Sequencing proceeded rapidly, as before; about 2 days from receipt of the samples to generation of data for assembly. Three machines were enlisted, using current generation SMRTcells but with pre-release polymerase and protocols (these are scheduled to go to customers by year's end). A number of eye-popping statistics are shown: mean mappable reads of 2.5-3Kb, 5% of the reads at 6-7Kb or greater and one herculean read of almost 23Kb! Yield from the SMRTcells was variable, ranging from a mean per sample of 13K to 55K (overall mean of 26K). Raw accuracy was still about 85%.
For the German strain, both a large insert library and a small insert circular consensus library were sequenced. One challenge that the PacBio team ran into is rapidly generating a DNA population of defined size which is bigger than about 9Kb. The populations must have a restricted size range, or else the smaller fragments are preferentially loaded into the zero mode waveguides where the sequencing actually takes place. Due to this, the value of the monster reads was diminished. Indeed, Schadt sees this leading to PacBio's inability to drive their data to a single contig (the main chromosome is broken into 33 contigs); there simply weren't reads able to bridge some very large repeat elements. This also is why they didn't use the strobe sequencing mode, which generates islands of sequence separated by statistically-defined gaps. Given that so many of the reads were approaching the size of the inserts, strobing wouldn't do much good.
Not very far back, I presented a somewhat bearish case for PacBio. Do the latest results change this? As much as I love the technology and the idea of long reads, I'm still concerned that too many scientists will see these as a nice-to-have and not a must have. Dr. Schadt mentioned they are working on projects to demonstrate PacBio sequencing of much larger (100+Mb) genomes, which is an important start. Still, it may be that few labs will see the incremental value of long PacBio reads as not important enough, or that they simply don't need to invest in a machine but instead rent some of the existing capacity (I know of two service providers offering the system). As PacBio pushes the reads longer and longer (imagine if 5% of the reads were 20+Kb!) it will offer advantages for closing long gaps. For example, PacBio should require relatively little DNA for library construction, whereas some of the competing mate-pair techniques are notorious for being very inefficient at converting input DNA to usable fragments (as well as creating some level of noise from chimeras generated in ligation steps).