After thinking about my previous entry on PacBio's sample prep and variant sensitivity paper I realized there was a significant gap -- neither dealt with gaps.
Gaps, or indels, are very important. Not only are indels a significant form of polymorphism in genomes, but small indels are one route in which tumor suppressor genes can be knocked out in cancer.
Small indels have also tended to be troublesome in short read projects -- one cancer genome project missed a known indel until the data was reviewed manually. Unfortunately, no more details were given of this issue, such as the depth of coverage which was obtained and whether the problem lay in the alignment phase or in recognizing the indel. The Helicos platform (perhaps currently looking like it should have been named Icarus) had a significant false indel rate due to missed nucleotide incorporations ("dark bases"). Even SOLiD, whose two-base encoding should in theory enable highly accurate base calling, had only a 25% rate of indel verification in one study (all of these are covered in my review article).
Since PacBio, like Helicos, is a single molecule system it is expected that missed incorporations will be a problem. Some of these will perhaps be due to limits in their optical scheme but probably many will be due to contamination of their nucleotide mixes with unlabeled nucleotides. This isn't some knock on PacBio -- it's just that any unlabeled nucleotide contamination (either due to failure to label or loss of the label later) will be trouble -- even if they achieve 1 part in a million purity that still means a given run will have tens or hundreds of incorporations of unlabeled nucleotides.
There's another interesting twist to indels -- how do you assign a quality score to them? After all, a typical phred score is the probability that the called base is really wrong. But what if no base should have been called? Or another base called prior to this one? For large numbers of reads, you can calculate some sort of probability of seeing the same event by chance. But for small numbers, can you somehow sort the true from the false? One stab I (and I think others) have made at this is to give an indel a pseudo-phred score derived from the phred scores of the flanking bases -- the logic being that if those were strong calls then there probably isn't a missing base but if those calls were weak then your confidence in not skipping a base is poor. The function to use on those adjacent bases is a matter of taste & guesswork (at least for me) -- I've tried averaging the two or taking the minimum (or even computing over longer windows), but have never benchmarked things properly.
Some variant detection algorithms for short read data deal with indels (e.g. VarScan) and some don't (e.g. SNVMix). Ideally they all would. There's also a nice paper on the subject that unfortunately didn't leave lazy folks like me their actual code.
So, in conclusion I'd love to see PacBio (and anyone else introducing a new platform) perform a similar analysis of their circular consensus sequencing on a small amount of a 1 nucleotide indel variant spiked into varying (but large) backgrounds of wildtype sequence. Indeed, ideally there would be a standard set of challenges which the scientific community would insist that every new sequencing platform either publish results on or confess inadequacy for the task. As I suggested in the previous post, dealing with extremes of %GC, hairpins, mono/di/tri/tetranucleotide repeats should be included, along with the single nucleotide substitution and single nucleotide deletion mixtures. To me these would be far more valuable in the near term than the sorts of criteria in the sequencing X-prize for exquisite accuracy. Those are some amazing specs (which I am on record as being skeptical they will be met anytime soon). What we trying to match platforms with experiments (and budgets) really need is the nitty gritty details of what works and what doesn't at what cost (or coverage).