Wednesday, June 23, 2010

PacBio oiBcaP PacBio oiBcaP

PacBio has a paper in Nucleic Acid Research giving a few more details on sample prep and some limited sequencing quality data on their platform.

The gist (which was largely in the previous Science paper) is that the templates are prepared by ligating a hairpin structure on to each end. In this paper they focused on a PCR product, with the primers designed with type IIS (BsaI) restriction sites flanking the insert. Digestion yields sticky ends (and with BsaI, these can be designed to be non-symmetric to discourage concatamerization) which enable ligating the adapters. Of course, for applying this on a large scale you do run into the problem of avoiding the restriction sites. There are other approaches, such as uracil-laden primer segments which can be specifically destroyed the DUT and UNG enzymes -- the only catch being that many of the most accurate polymerases can be poisoned by uracil. A couple of other schemes have either been proven or can be sketched quickly.

In any case, once these molecules are formed they have the nice property of being topologically circular and having the insert sequence represented twice (once in reverse complement) within that circle, with the two pieces separated by the known linker sequence. So if the polymerase lasts to go round-and-around, then each pass gives another crack at the sequence. For short templates, this gives lots of rounds and even on a longer (1Kb) PCR product they successfully had three bites at the apple. Feed these into a probabilistic aligner and each pass improves the quality values. Running the same template many times enabled calibrating the quality values (which are Phred-scaled), showing them to be a reasonable estimator of the likelihood of miscalling the consensus.

Given that eventually their polymerase dies on a template, they point out that the length of a fragment is highly predictive of the number of passes which can be observed which in turn implies the sequencing quality which will be obtained for a template. This makes for an interesting variation on the usual "what size library" question in sequencing. For sequencing a bacterial genome, it may make sense to go very deep with short fragments to get high quality pieces which can then be scaffolded using much longer individual reads without the circular consensus effect as well as the spaced "strobe" reads to provide very long range scaffolding. For a bacterial genome & the claims made of capacity per SMRT cell, it would seem that making 2 libraries (one short for consensus sequencing, one long for the others) and burning perhaps 4-5 $100 SMRT cells, one could get a very good bacterial genome draft. For targeted sequencing by PCR, the approach is quite attractive. Amplicon length becomes an important variable for final quality. Appropriate matching of the upstream multi-PCR technology (such as RainDance or Fluidigm or just lots of conventional PCRs) with PacBio for capacity will be necessary -- and hard numbers on throughput/SMRT cell are desperately needed!

One side thought: is there a risk with strobe sequencing of accidentally going off the end and reading back along the same insert -- but not realizing it since the flip turn in the hairpin was done in the dark? It would suggest that libraries for strobe sequencing need to be very stringently sized.

One curiosity of their plot of this phenomenon is that the values appear to be asymptotically approaching phred 40 -- an error rate of 1 in 10,000. Is this really where things top out? That's a good quality -- but for some applications possibly not good enough.

They go on to apply this to measuring the allele frequency of a SNP in defined mixtures. Given several thousand reads and the circular consensus, they succeeded at this -- even when the minor allele was present at a frequency of 2.5%. This isn't as far as some groups have pushed 2nd generation sequencing for rare allele detection (such as finding a cancer mutation in a background of normal DNA), but is certainly in the range of many schemes used in this context (such as real-time PCR). If my inference of a maximum phred score is correct, it would tend to limit the sensitivity for rare mutation detection somewhere in the parts-per-thousand range; the record is around this (1 in 0.1% reported by Yauch et al)

My major concern with this paper is the lack of a breadth of substrates. Are there template properties which give trouble to the system? The obvious candidates are extremes of composition, secondary structures and simple repeats. Will their polymerase power through 80+% GC? Complex hairpins? Will stuttering be observed on long simple repeats? Can long mononucleotide runs be measured accurately?

In the end, the key uses for the first release PacBio system will depend on where these problems are and those throughput numbers. This paper is a useful bit of information, but much more is needed to determine when PacBio is either cost-effective for an application (vs. other sequencing or non-sequencing competitors) or when the performance advantages in that application (such as turn-around time) push them to the fore. Personally, I think it is a mistake on PacBio's part not to be literally flooding the world with data. Or at least the bioinformatics world -- only when they start releasing data to the broad developer community will there be the critical tweaking of existing tools and development of new tools to really take advantage of this platform and also cope with whatever weaknesses it possesses.
Travers, K., Chin, C., Rank, D., Eid, J., & Turner, S. (2010). A flexible and efficient template format for circular consensus sequencing and SNP detection Nucleic Acids Research DOI: 10.1093/nar/gkq543