Monday, July 11, 2011

PacBio's Foray Against the German E.coli Outbreak

    I recently remarked that it was surprising that PacBio had not jumped on the German E.coli outbreak strain, given CEO Eric Schadt's professed interest in biosurveillance.  Dr. Schadt was kind enough to spend a half hour discussing the topic with me last week, in the wake of PacBio releasing de novo assemblies of this strain and 11 other pathogenic E.coli strains (6 having never previously been sequenced) on the PacBio DevNet website. Dr. Schadt also has a detailed blog post on the project.
    In both the blog post and our conversation yesterday he explained why PacBio wasn't in the forefront of this effort but still decided to jump in. Dr. Schadt was certainly aware of the outbreak; he was in Germany at a conference when the public alarm was building. But initially, the company was trying to stay focused on the commercial launch of their instrument. However, after seeing the first public assembly come out at several thousand contigs, they consulted some academic collaborators and decided to run their own sequencing, with a goal of providing a much less fragmented assembly.  The initial thought was a hybrid assembly containing both PacBio and short read data, but after another group generated a high quality short read assembly the emphasis switched to a PacBio-only assembly.
     Sequencing proceeded rapidly, as before; about 2 days from receipt of the samples to generation of data for assembly.  Three machines were enlisted, using current generation SMRTcells but with pre-release polymerase and protocols (these are scheduled to go to customers by year's end). A number of eye-popping statistics are shown: mean mappable reads of 2.5-3Kb, 5% of the reads at 6-7Kb or greater and one herculean read of almost 23Kb!  Yield from the SMRTcells was variable, ranging from a mean per sample of 13K to 55K (overall mean of 26K).  Raw accuracy was still about 85%.
     For the German strain, both a large insert library and a small insert circular consensus library were sequenced.  One challenge that the PacBio team ran into is rapidly generating a DNA population of defined size which is bigger than about 9Kb.  The populations must have a restricted size range, or else the smaller fragments are preferentially loaded into the zero mode waveguides where the sequencing actually takes place.  Due to this, the value of the monster reads was diminished.  Indeed, Schadt sees this leading to PacBio's inability to drive their data to a single contig (the main chromosome is broken into 33 contigs); there simply weren't reads able to bridge some very large repeat elements.  This also is why they didn't use the strobe sequencing mode, which generates islands of sequence separated by statistically-defined gaps.  Given that so many of the reads were approaching the size of the inserts, strobing wouldn't do much good.
    Not very far back, I presented a somewhat bearish case for PacBio.  Do the latest results change this? As much as I love the technology and the idea of long reads, I'm still concerned that too many scientists will see these as a nice-to-have and not a must have.  Dr. Schadt mentioned they are working on projects to demonstrate PacBio sequencing of much larger (100+Mb) genomes, which is an important start.  Still, it may be that few labs will see the incremental value of long PacBio reads as not important enough, or that they simply don't need to invest in a machine but instead rent some of the existing capacity (I know of two service providers offering the system).  As PacBio pushes the reads longer and longer (imagine if 5% of the reads were 20+Kb!) it will offer advantages for closing long gaps.  For example, PacBio should require relatively little DNA for library construction, whereas some of the competing mate-pair techniques are notorious for being very inefficient at converting input DNA to usable fragments (as well as creating some level of noise from chimeras generated in ligation steps).  

11 comments:

Anonymous said...

How would you rate the utility of PacBio vs a mapping technology such as OpGen in 'closing the gaps'. It seems a combination of short read sequancing with OpGen'esqe mapping might be just as valable as PacBio (if only they can bring the mapping price down!)

Jonathan Badger said...

Technically, Eric is the CSO, meaning he's the head science guy at Pac Bio, rather than the CEO.

Keith Robison said...

Ugh, first I botch the name, then the title. What next?

flxlex said...

You wrote "PacBio should require relatively little DNA for library construction".
This remains to be seen, as they describe using the hydroshear for generating long fragments (see the assemblathon parrot PacBio data). The hyrdoshear requires quite a lot of DNA, at least for the 454 mate-pair libraries. As PacBio library prep also involves adapter ligation, but fortunately no circularization, I think the input requirements will be pretty high still...

Keith Robison said...

One of the PacBio service providers requires 5ug of input DNA for standard sequencing, which is not oppressive though certainly not just a sip.

My impression of mate pair protocols is that many require 20-50ug of input DNA, because the post-circularization steps are inefficient in most protocols. For example, many rely on mechanical shearing, so only a minority of molecules will have broken in the two regions necessary but not in the "vector" backbone.

Somewhere I thought I saw the actual requirement as 500ng (service providers often ask for several fold what the protocol demands), but I don't seem to have that as handy.

Anonymous said...

Keith, users have repeatedly told me it requires massive amount of DNA to work, ie not so great for a needle biopsy for example

Keith Robison said...

Needle biopsies are notorious for yielding very little DNA; it's a bit silly to call anything more "massive".

Correct me if I'm wrong, but kits (either from the manufacturer or 3rd parties) for other platforms ask for 100ng or more (Nextera might be 50ng?). That's still a lot more than some biopsies (or laser capture) might yield, but easily satisfied from blood, culture or most xenografts (though not very early ones).

kris said...

Where does it leave Roche GS FLX. They seem to be stranded in further improvements to their technology. No drastic price reductions are in sight either.

Anonymous said...

Mind boggling to see 454 frozen in time... Not even banking on their old school reputation...

kris said...

Just scanned through the PacBio publication in NEJM. Now that the humble E. coli strain from Germany has been sequenced and assembled by different labs using all most all NGS platforms (Ion, MiSeq, PacBio, 454 Flx, GS junior, HiSeq etc.) and various bioinformatic tools, is there a meta analysis study (whole genome comparison of independent de novo assemblies of raw sequence data from each single platform) of these results that I missed? Hopefully through such an impartial study we could know more about the strengths and weaknesses of the NGS platforms as well as lab to lab variations. Would the core genome size increase substantially from 2.6 Mb reported in NEJM study while comparing 53 old and new whole genome sequences of different E coli strains? Theoretically it should, but by how much is the question!

Anonymous said...

What I am confused about and that hasn’t really been asked of PacBio is did PacBio have to collect all the data they did. If you go to the genbank single read archive, PacBio used 32 ZMWs for closed circular reads and 24 ZMWs for the long subreads and this was for only the single E. coli outbreak isolate. This is supported by their supplementary publication information in their recent publication (Supplementary Table 1). That is a total of 56 ZMWs they used for one isolate, which were collected on 3 instruments running simultaneously. They did state at one point that all 56 ZMWs were used to build their assembly of the outbreak isolate. If they really needed all 56 ZMWs for the single isolate, that would be over $5,500 at list price and would have taken over 3 days in just data collection if they were to use a single instrument (more realistic since I don’t think any customers have more than 1 yet since 3 instruments would be a list cost of over $2,000,000). Yet, they were only able to get the contig number down to 37 contigs (33 for the chromosome and 4 for plasmids)? HPA’s 454 run(s) resulted in 13 contigs. OpGen’s run resulted in 1. I guess I just do not understand what PacBio’s long subreads or higher accuracy closed circular reads are providing if they can’t even decrease the contig number fewer than a 454. Is this the PacBio distortion field or am I totally missing something (I very well could be)? It seems to be more cost effective and faster to use a lower priced sequencer (e.g. 454 or illumina) and then pair it with an OpGen single scaffold to assemble the genome like the previous poster suggested. OpGen’s maps give the whole genome view with important areas identified that inform and guide sequencing.