Pacific Biosciences long ago introduced the idea of Circular Consensus Sequence (CCS), which they have recently rebranded as HiFi. PacBio libraries consist of SMRTbells, sheared fragments bounded by hairpins to form a closed circular single strand if denatured. As shown in the image swiped from some PacBio documentation, this means the insert (I) is bounded by stems (B) and loops (A). The sequence for non-barcoded adapters is below (I've inserted spaces to mark the loop structure
In HiFi sequencing, the read can start anywhere, as the polymerase anneals in a loop but can start working before being observed. Let's imagine for simplicity we have a read that starts at base 0 of the insert I on the top strand. It then goes through I and the Br adapter, traverses the Ar loop and then through the Br' and I' and so forth. The PacBio software takes this initial polymerase read and finds the adapters and removes them to generate a series of subreads which should be in the order I-I'-I-I'-I etc. -- a forward version of the insert alternating with a reverse version of the insert (or vice versa; I'm arbitrarily calling the first pass forward). Combining the information from all those passes enables generating a HiFi read of high accuracy, with each read coming with an overall error rate.
In order to assess HiFi sequencing, we recently had a third party sequence some samples. One of these samples was a plasmid cloned in a standard E.coli host. We see the same pattern both for the plasmid of interest and the inevitable annoying E.coli background: lots of HiFi reads with varying number of passes across the insert and therefore varying consensus quality. Just what we expect.
Except, we also see a bunch of quirky reads in which the HiFi read contains an insert immediately followed by its reverse complement! Now, it would hardly be surprising if bad luck occasionally meant a SMRTbell was missed due to the sequence quality going south there. So I wouldn't be shocked if a read with only two passes had this issue. But more than 1400 of the reads have many passes, and in no subread is the expected 45 basepair adapter present yet have this sequence immediately followed by reverse complement structure. Huh??????
None of the possible explanations I've devised for these "boomerangs" are very satisfying. Obviously perfect inverted repeats exist in genomes, but none of these appear to correspond to such. In addition, they're all neatly centered in the insert, which would be remarkable. It's difficult to imagine a molecular biological side-reaction in the library prep that would generate these; again the precision of the pattern is difficult to explain. Software is always a candidate scapegoat, but why would the PacBio subread splitting software possibly perform such a perverse trick and so consistently -- and in only a small subset of reads?
I've wangled permission to dump a subset of 321 CCS reads and their corresponding subreads into the public sphere on GitHub: https://github.com/krobison13/201908-PacBio-CCS-Boomerangs .
reposted from tweet for future reference:
ReplyDelete@OmicsOmicsBlog @PacBio Hi Keith! This looks like molecular *missing* adapters as opposed to algorithmically *missed* adapters. Increasing the adapter conc should help. But don’t take my word for it; the quickest way to get help on something like this is to email our tech support, support@pacb.com.
If the end of the insert is a bit denatured, couldn't the ligation reaction join one strand to the other, without adding an adapter? Does the ligation reaction really require blunt-end double-stranded DNA, or can it sometimes join single-stranded DNA?
ReplyDeleteKeith,
ReplyDeleteSee Billy's response (@nothingclever)
https://twitter.com/nothingclever/status/1164927629742358528
"Hi Keith! This looks like molecular *missing* adapters as opposed to algorithmically *missed* adapters. Increasing the adapter conc should help. But don’t take my word for it; the quickest way to get help on something like this is to email our tech support, support@pacb.com."
PacBio is happy to help.
Hello Keith,
ReplyDeletePacBio Tech Support here. This is a known issue for certain samples that has been resolved with enhanced adapter finding in Sequel 8M.
There are a few options to work around missing adapter molecules in existing datasets.
If you have the subreads.bam and scraps.bam [preferred]:
1. Run https://anaconda.org/bioconda/recalladapters. `recalladapters --adapters adapters.fasta in.subreads.bam in.scraps.bam -o out`. The (relatively new) "adapter correction" logic identifies the perfectly-centered inverted repeats in multi-pass reads that you describe and splits such subreads in the center. Today, the adapter correction logic is activate by default on the Sequel II System but not the Sequel System.
If you only have the subreads.bam:
2. If you have a tight fragment size distribution, set `--maxLength` to 1.5x your average fragment size when running CCS. This will avoid producing CCS reads for missing adapter molecules.
If you only have the ccs.bam / ccs.fastq:
3. If you have a tight fragment size distribution, discard reads that are more than 1.5x the average fragment size.
Hello Keith,
ReplyDeletePacBio Tech Support here. This is a known issue for certain samples that has been resolved with enhanced adapter finding in Sequel 8M.
There are a few options to work around missing adapter molecules in existing datasets.
If you have the subreads.bam and scraps.bam [preferred]:
1. Run https://anaconda.org/bioconda/recalladapters. `recalladapters --adapters adapters.fasta in.subreads.bam in.scraps.bam -o out`. The (relatively new) "adapter correction" logic identifies the perfectly-centered inverted repeats in multi-pass reads that you describe and splits such subreads in the center. Today, the adapter correction logic is activate by default on the Sequel II System but not the Sequel System.
If you only have the subreads.bam:
2. If you have a tight fragment size distribution, set `--maxLength` to 1.5x your average fragment size when running CCS. This will avoid producing CCS reads for missing adapter molecules.
If you only have the ccs.bam / ccs.fastq:
3. If you have a tight fragment size distribution, discard reads that are more than 1.5x the average fragment size.
It is likely that the SMRT bell adapters were ligated to one end of the these fragments, but something else were ligated to the other ends. So when you run CCS generation, one pass is actually fwd + rev, instead of fwd or rev.
ReplyDeleteHi guys we work a lot on rolling circle amplification and the fact that phi29 makes long run offs it sometimes strand jumps and then you you get inverted reads to what is expected. this is an inherent problem with these enzymes like Bst and phi29. The theory, off the top of my head, is that when the peeled of strand gets larger the proximity to the amplified strand is close and at some point the likelihood of the pool jumping strands increases. Look at link for an example https://pubs.rsc.org/image/article/2015/AN/c4an01711k/c4an01711k-s1_hi-res.gif
ReplyDelete