Saturday, May 14, 2011

Oh Would It Be Fun to Debug Strobe Sequencing!

Through the course of a month, I easily sketch out (on average) a dozen plus experimental designs in the context of my day job.  Of course, only a very few of these are ever executed; many never get shown to another soul.  By constantly pondering how I might tackle questions, I can keep in practice and present plans which are reasonably well thought-out.  It also helps to have thought through things; sometimes a rejected plan will suddenly look like a gem due to some other result or change in priorities.
Atop that, I also sketch up a few which have nothing to do with my day job.  Partly it is fun, and partly it is a way to exercise the process on areas I can be certain of my objectivity.  A related exercise is to sketch out a business plan; I end up doing that a few times a year.  Never have taken any further than a napkin; not only do I lack the necessary thirst for risk, but most are for businesses I'd be happier being a customer than an employee.

For example, after a recent lunch with a friend from graduate school I found myself again contemplating a question that had arisen at Codon in the context of gene synthesis.  We didn't have second generation sequencing there (if Codon had survived, we certainly would have launched into it at some point) and had seen the hint of an interesting phenomenon (and practical problem) but could have never collected enough data to really nail it down.  Now, with sequencing cheap on a large scale, and this being the perfect sort of problem for such sequencing (since one would need only very short reads), it was short-term obsession to work out how to run such an experiment.  For probably $5K and a tiny bit of molecular biology, a nice little paper -- pity I don't have a slush fund to cover it.   My apologies for not supplying any details; maybe I will someday have some found money to cover the project.

However, I did just get another idea that I might as welll be open about, as for one it will probably be solved in the near future and two there is no opportunity to actually work on it.  So it can be fun to speculate in the open on how to address the problem.
Someone was kind enough to show me the recent In Sequence article describing user experiences with the PacBio sequencer, as reported at the Cold Spring Harbor Laboratory Biology of Genomes Meeting (a subscription to that newsletter would be another fine use of my non-existent slush fund). The article has some useful information, though it suffers from my usual complaint of reporting the number of sensors (ZMWs) in the system and not the number actually acquiring data.  However, there is an interesting tidbit here that impinges on this.  Apparently each cell is scanned twice; the system can only image half of a cell (75K ZMWs) at a time.  The first "movie" in one case yielded 35Mb of 1400 mean read length data, or about 25Kreads, but the second movie yielded 25Mb of 1500 mean read length data, or about 16Kreads.  Since the polymerases are working on the second set while the first set is being watched, this would suggest a selection for a smaller number of higher performing polymerases (or templates).

The early reports suggest that for genome assembly the PacBio will provide truly valuable scaffolding information.  In one bacterial example,  22 contigs from a pure Illumina assembly went to a single sequence with the PacBio data thrown in.   It was also useful to learn that PacBio does have difficulty with high-GC and high-AT regions, though otherwise the errors look random (and hence can be battled by high oversampling).
What really caught my attention are some details on the strobe sequencing.  This is particularly interesting since it is a mode unique to single molecule systems and the PacBio's long read lengths should make it very valuable (Helicos called a similar scheme "dark fill").  In strobe sequencing, the illuminating laser is periodically switched off; in theory this lets the polymerases fill a region of sequence without being exposed to damage and then acquire additional data farther away.  Because the polymerases work at predictable speeds, the distance covered in the dark can be estimated.

The real beauty of this approach is that a single library prep can give multiple distance estimates, simply by changing how the  instrument is operated.  One run could do no strobing, a second with a dark period of 1Kb, another with a dark period of 10Kb and so on.  Ideally, one could flip the switch multiple times and obtain a series of sequence islands.  This is in contrast to the approaches in most systems, in which each library can give only a single class of distance information.  If you want three classes of mate pairs in SOLiD, 454 or Illumina, you need to make three different libraries.

Unfortunately, the reality for PacBio strobe sequencing is initially not what one would hope.   The machines were programmed for two dark periods, which should result in three sequence islands.  Unfortunately, about 2/3 of the ZMWs quit after the first dark period; 26% gave two islands and only 7% all three.  This make the strobe sequencing much less efficient.

The question now is why?  What is causing so much attrition?  7% is close to 26% of 26%, so it appears to be an exponential decay process.  One suggestion is that some of the polymerases are running off the ends of the DNA molecules; in theory the PacBio is sequencing closed circular DNAs but perhaps some have the necessary closure only at one end.  It's also curious given the numbers for the 1st and 2nd views of a SMRTcell I repeated above; there 2/3 of the ZMWs gave useful sequence after an extended period of dark synthesis.  It isn't clear if the two results were on the same libraries, so I may be comparing apples with oranges.

If I were trying to tackle this, the first thing I would do is to go boring.  By this I mean preparing libraries where every input molecule is the same and a known quantity; this is deathly dull and means generating nothing new in terms of sequence data, but is the most valuable for understanding a process.  I'd pick two or three dull molecules, picked for different size ranges.  Perhaps a piece of lambda phage or some clone insert that can easily be popped out.  These fragments would also be carefully size-selected, to ensure a very high-quality of dullness.

Then I'd run the strobe sequencing with a large number of short strobes, so as to get a better picture of the decay curve.  Is it really a straight exponential decay or something more complicated?  Does the length of time in the dark (or the length of the illumination periods) alter the curve?  Is there any additives which alter the behavior?  Some of the dark periods would be quite short, but other runs would be designed to have longer dark periods.

How would this be helpful?  First, getting a finer picture of the decay curve.  Second, if the library is truly of a uniform length going in, then the drop-off can be modeled more precisely.  For example, suppose most of the drop-off is due to incompletely circularized input molecules.  If this is the case, then I should see virtually no drop until I get to the length of my input material, but then a very sudden drop once I get there.  With careful timing, it should be possible to have that possible end be during a light period for most ZMWs.
On the other hand, suppose some process is blocking polymerase at random (or semi-random) spots along the molecules; some sort of random blockage or cleavage of the DNA.  In this case, I'd see a decay curve which was not synchronized with my input ends.

Another great advantage of preparing a very large quantity of a few boring libraries is that these experiments could be run very quickly.  Analysis of the data could be used to generate hypotheses which could in turn be tested almost immediately.  With the PacBio, one could even imagine systematic tests of various parameters in a Design of Experiments framework.  

Of course, once great progress was made on the boring libraries the generality of the solutions would need to be tested on a greater diversity of samples.  One also cannot completely discount the idea that strobe sequencing will always incur a cost versus non-strobe.  Right now, that cost is steep, but perhaps with some optimization it can be made far less taxing.


Anonymous said...

Looks like some folks are actually starting to think about this:

Keith Robison said...

My impression is that OpGen's box sells for an awful lot also, and it does just one thing.

There is a real danger that sequencing-based methods will fill the niche that OpGen is trying to occupy, and those will provide much more information (such as haplotypes).

For example, an acquaintance mentioned getting good mate pair libraries from original inserts around 30Kb. HAPPy mapping by sequencing is apparently been demonstrated on an invertebrate genome. Illumina sequencing of liquid culture fosmids was successfully used to haplotype a human (published by Shendure's lab). Several groups have published on generating Illumina libraries from isolated chromosomes.

If OpGen is going to succeed in the short term, they need to really grab attention & become the recognized standard method. Otherwise, the sorts of things I mentioned above, PacBio pushing their read lengths out & the spectre of ultra-long read sequencing technologies (which would not require much accuracy to go after OpGen's space) will nibble away at OpGen's market.

Anonymous said...

Hi Keith, excellent post I found after looking for data on the strobe success rates on a PacBio. I thought the 2/3 ZMW quitting after the first dark period was startling. However, I did notice your original post was back in May. Have you heard any new updated data since it is now August?

In terms of the OpGen technology that you wrote about in the comments, it would seem to me that a single molecule mapping read that OpGen collects (they’ve stated >200 Kb and no theoretical upper size limit) is far and away larger than any strobe read, any sequence read, or any insert library currently used. So, OpGen still brings data to the table and they are more complementary to sequencing than competitive at this point and for the foreseeable future. Just like many people are using multiple sequencing read data sets (e.g. illumina and 454) to complete genomes, using an OpGen technology hybrid assembly approach with nextgen data has already been shown to be fruitful (Sequence Finishing and Analysis in the Future 2011 conference). It doesn’t seem as though any one set of data will rule them all.