Friday, August 23, 2019

Mirror, Mirror Have You Leads, On My Strange HiFi Reads?

A core purpose of this space is to explore the current state of genomics technology.  Much of the time this is via distilling news reports, press releases and interviews with persons in the field.  But even more fun is to dive into actual data.  Such data is often accessed via the generosity of researchers who deposit open access datasets.  But it is also true that part of my professional responsibilities is to determine when new technologies and methods have applications at my day job, so I'm also charged in the day job with contemplating experiments to plumb sequencing systems.  Only by doing so can we ensure that we maximize our ability to perform cutting-edge synthetic biology.  A recent such experiment generated some curious data which I have obtained permission to share a subset of it publicly, as I'm scratching my head and hope that someone out in the community has insight.
Pacific Biosciences long ago introduced the idea of Circular Consensus Sequence (CCS), which they have recently rebranded as HiFi.  PacBio libraries consist of SMRTbells, sheared fragments bounded by hairpins to form a closed circular single strand if denatured.  As shown in the image swiped from some PacBio documentation, this means the insert (I) is bounded by stems (B) and loops (A).  The sequence for non-barcoded adapters is below (I've inserted spaces to mark the loop structure
5’ – pATCTCTCTC TTTTCCTCCTCCTCCGTTGTTGTTGTT GAGAGAGAT – 3’


In HiFi sequencing, the read can start anywhere, as the polymerase anneals in a loop but can start working before being observed.  Let's imagine for simplicity we have a read that starts at base 0 of the insert I on the top strand.  It then goes through I and the Br adapter, traverses the Ar loop and then through the Br' and I' and so forth.  The PacBio software takes this initial polymerase read and finds the adapters and removes them to generate a series of subreads which should be in the order I-I'-I-I'-I etc. -- a forward version of the insert alternating with a reverse version of the insert (or vice versa; I'm arbitrarily calling the first pass forward).  Combining the information from all those passes enables generating a HiFi read of high accuracy, with each read coming with an overall error rate.

In order to assess HiFi sequencing, we recently had a third party sequence some samples.  One of these samples was a plasmid cloned in a standard E.coli host.  We see the same pattern both for the plasmid of interest and the inevitable annoying E.coli background: lots of HiFi reads with varying number of passes across the insert and therefore varying consensus quality.  Just what we expect.

Except, we also see a bunch of quirky reads in which the HiFi read contains an insert immediately followed by its reverse complement!  Now, it would hardly be surprising if bad luck occasionally meant a SMRTbell was missed due to the sequence quality going south there. So I wouldn't be shocked if a read with only two passes had this issue.  But more than 1400 of the reads have many passes, and in no subread is the expected 45 basepair adapter present yet have this sequence immediately followed by reverse complement structure.  Huh??????

None of the possible explanations I've devised for these "boomerangs" are very satisfying.  Obviously perfect inverted repeats exist in genomes, but none of these appear to correspond to such.  In addition, they're all neatly centered in the insert, which would be remarkable.  It's difficult to imagine a molecular biological side-reaction in the library prep that would generate these; again the precision of the pattern is difficult to explain.  Software is always a candidate scapegoat, but why would the PacBio subread splitting software possibly perform such a perverse trick and so consistently -- and in only a small subset of reads?

I've wangled permission to dump a subset of 321 CCS reads and their corresponding subreads into the public sphere on GitHub: https://github.com/krobison13/201908-PacBio-CCS-Boomerangs .  



6 comments:

BR said...

reposted from tweet for future reference:

@OmicsOmicsBlog @PacBio Hi Keith! This looks like molecular *missing* adapters as opposed to algorithmically *missed* adapters. Increasing the adapter conc should help. But don’t take my word for it; the quickest way to get help on something like this is to email our tech support, support@pacb.com.

gasstationwithoutpumps said...

If the end of the insert is a bit denatured, couldn't the ligation reaction join one strand to the other, without adding an adapter? Does the ligation reaction really require blunt-end double-stranded DNA, or can it sometimes join single-stranded DNA?

Jane M Landolin said...

Keith,

See Billy's response (@nothingclever)
https://twitter.com/nothingclever/status/1164927629742358528

"Hi Keith! This looks like molecular *missing* adapters as opposed to algorithmically *missed* adapters. Increasing the adapter conc should help. But don’t take my word for it; the quickest way to get help on something like this is to email our tech support, support@pacb.com."

PacBio is happy to help.



Jane M Landolin said...

Hello Keith,
PacBio Tech Support here. This is a known issue for certain samples that has been resolved with enhanced adapter finding in Sequel 8M.

There are a few options to work around missing adapter molecules in existing datasets.

If you have the subreads.bam and scraps.bam [preferred]:
1. Run https://anaconda.org/bioconda/recalladapters. `recalladapters --adapters adapters.fasta in.subreads.bam in.scraps.bam -o out`. The (relatively new) "adapter correction" logic identifies the perfectly-centered inverted repeats in multi-pass reads that you describe and splits such subreads in the center. Today, the adapter correction logic is activate by default on the Sequel II System but not the Sequel System.

If you only have the subreads.bam:
2. If you have a tight fragment size distribution, set `--maxLength` to 1.5x your average fragment size when running CCS. This will avoid producing CCS reads for missing adapter molecules.

If you only have the ccs.bam / ccs.fastq:
3. If you have a tight fragment size distribution, discard reads that are more than 1.5x the average fragment size.

Jane M Landolin said...

Hello Keith,
PacBio Tech Support here. This is a known issue for certain samples that has been resolved with enhanced adapter finding in Sequel 8M.

There are a few options to work around missing adapter molecules in existing datasets.

If you have the subreads.bam and scraps.bam [preferred]:
1. Run https://anaconda.org/bioconda/recalladapters. `recalladapters --adapters adapters.fasta in.subreads.bam in.scraps.bam -o out`. The (relatively new) "adapter correction" logic identifies the perfectly-centered inverted repeats in multi-pass reads that you describe and splits such subreads in the center. Today, the adapter correction logic is activate by default on the Sequel II System but not the Sequel System.

If you only have the subreads.bam:
2. If you have a tight fragment size distribution, set `--maxLength` to 1.5x your average fragment size when running CCS. This will avoid producing CCS reads for missing adapter molecules.

If you only have the ccs.bam / ccs.fastq:
3. If you have a tight fragment size distribution, discard reads that are more than 1.5x the average fragment size.

Yu said...

It is likely that the SMRT bell adapters were ligated to one end of the these fragments, but something else were ligated to the other ends. So when you run CCS generation, one pass is actually fwd + rev, instead of fwd or rev.