Pacific Biosciences long ago introduced the idea of Circular Consensus Sequence (CCS), which they have recently rebranded as HiFi. PacBio libraries consist of SMRTbells, sheared fragments bounded by hairpins to form a closed circular single strand if denatured. As shown in the image swiped from some PacBio documentation, this means the insert (I) is bounded by stems (B) and loops (A). The sequence for non-barcoded adapters is below (I've inserted spaces to mark the loop structure
In HiFi sequencing, the read can start anywhere, as the polymerase anneals in a loop but can start working before being observed. Let's imagine for simplicity we have a read that starts at base 0 of the insert I on the top strand. It then goes through I and the Br adapter, traverses the Ar loop and then through the Br' and I' and so forth. The PacBio software takes this initial polymerase read and finds the adapters and removes them to generate a series of subreads which should be in the order I-I'-I-I'-I etc. -- a forward version of the insert alternating with a reverse version of the insert (or vice versa; I'm arbitrarily calling the first pass forward). Combining the information from all those passes enables generating a HiFi read of high accuracy, with each read coming with an overall error rate.
In order to assess HiFi sequencing, we recently had a third party sequence some samples. One of these samples was a plasmid cloned in a standard E.coli host. We see the same pattern both for the plasmid of interest and the inevitable annoying E.coli background: lots of HiFi reads with varying number of passes across the insert and therefore varying consensus quality. Just what we expect.
Except, we also see a bunch of quirky reads in which the HiFi read contains an insert immediately followed by its reverse complement! Now, it would hardly be surprising if bad luck occasionally meant a SMRTbell was missed due to the sequence quality going south there. So I wouldn't be shocked if a read with only two passes had this issue. But more than 1400 of the reads have many passes, and in no subread is the expected 45 basepair adapter present yet have this sequence immediately followed by reverse complement structure. Huh??????
None of the possible explanations I've devised for these "boomerangs" are very satisfying. Obviously perfect inverted repeats exist in genomes, but none of these appear to correspond to such. In addition, they're all neatly centered in the insert, which would be remarkable. It's difficult to imagine a molecular biological side-reaction in the library prep that would generate these; again the precision of the pattern is difficult to explain. Software is always a candidate scapegoat, but why would the PacBio subread splitting software possibly perform such a perverse trick and so consistently -- and in only a small subset of reads?
I've wangled permission to dump a subset of 321 CCS reads and their corresponding subreads into the public sphere on GitHub: https://github.com/krobison13/201908-PacBio-CCS-Boomerangs .