Tuesday, October 26, 2021
PacBio has made its reputation delivering very high accuracy long reads, which they have branded HiFi. These are based on their circular consensus technology: each template DNA molecule is converted into a single continuous circle of DNA which can be read in a rolling circle reaction. The "movie" is converted to raw base calls and the adapters are clipped out, leaving "subreads" which can be aligned together to generate a consensus (CCS) read. With many passes over the same molecule and its complement, the relatively high (~15%) error rate of the raw data can be brought down substantially using an HMM-based scheme. PacBio calls reads HiFi at 1% error rate, but their model calls overall quality for reads and it can keep getting better from there. Homopolymers still bedevil the technology, though not like they once did and it turns out there is at least one more systematic error class. Consensus building is a powerful way to cut through error. But could you do better? Two recent preprints from large tech companies, with PacBio co-authors, apply deep learning to this problem and each comes up with the astounding result that they can do a bit over 40% better.