So on New Year's Eve I started getting the dataset together and reducing it to a bunch of dataframes, and today I pushed that a bit further and started graphing some of it. It's very much a rough project -- some of the dataframes have some issues I'm still chasing down with redundant data not being initially collapsed, but I think the data is accurate. I also think I have my conventions consistent -- at one point confused myself into inverting the labels on the plots! In other words, ApG would be labeled GpA -- not good! There's already some intriguing patterns, which are presumably the sort of signal tools like Medaka use to polish assemblies from FASTQ data aligned to draft references.
The dataset is about 65K bacterial 16S amplicons aligned to a control 16S RNA, generated with R9.4.1 flowcells. For each aligned position, I've captured not only the reference (truth) base, the called base and the Q-score, but also these statistics for the next aligned position. The reads were called with MinKNOW using the most up-to-date basecaller.
My key bit of curiosity is around a major Nanopore error mode: a missed base relative to the reference. This also gets into a challenge with Q-scores -- a Q-score is an estimate that you called a base correctly, on a log10 score. Every called base gets a Q-score -- but uncalled bases don't have a Q-score. So what I wanted to look at is to what degree the Q-score of a called base is predictive of the accuracy of the next base.
Given that I started with only fuzzy ideas of how to approach this, I took two major stabs at it. I think both are interesting enough to put here in hopes they stimulate others, but neither is remotely a story complete enough to even dream of writing a pre-print.
Stab 1: Given Reference Sequence & Quality, How Good is Calling Next Base?
So here is a gallery of plots for each reference dinucleotide, showing the cases in which the nanopore base is correct for the first position and wrong in the second. X-axis is Q-score, Y-axis is frequency. Each series is for the next aligned nanopore base or a gap. The cyan dotted line shows X=Y once you convert Q-scores to frequencies (but remember, we're plotting errors at N+1 vs. Q-score for position N)
This is more of a gestalt image, but you can see some differences which can be zoomed in on. For example, CC seems to have a high rate of miscalling A at the second C, even at high Q-score values for the first C
Contrast this with TC which has a high rate of gapping instead of calling C
Or with GC, which seems to be miscalling T at the second position even after a high confidence G call
Stab 2: Given Two Bases on Read, What Is Probability of a Gap?
In the above I used reference NpN sequences, but of course for a read we generally don't know the reference. What if we look at reads instead?
So I generated a new dataframe that focuses on base N, base N+1 and then after alignment to the reference what is the distance between those two, with the cyan dotted line at the expectation from Phred scoring. The X here is the minimum quality between the two, on the logic that the worst called base is a reasonable proxy for the probability of one or more bases failing to be called.
Here are the frequencies observed for deletions of 1, 2 or 3 bases. It does look like the minimum score works reasonably well, but many of the curves have a hump for deletion length 1 around scores of 20-30: apparently deletions are more commonly seen here then we would expect. The hump is far more prominent for some -- AC, AT, AG and CG have a very prominent hump whereas CA and GA do not
What does it all mean? I haven't figured any of that out. The nanopore basecallers will change again before long, next time to Bonito. If I recall correctly, Bonito doesn't generate Q-scores so this analysis wouldn't be possible So far it and I have had a troubled relationship -- in early-December I decided to re-call a different dataset with Bonito and spent a good part of a day struggling to get an AWS GPU instance correctly configured to run Bonito -- as well as attempts to install it on several local machines -- and then re-called all my data only to get lousy results. Apparently if you use Bonito with an R9.4.1 model on R10.3 data, the results are awful -- who would have guessed? I had simply forgotten that the datasets were R10.3; we normally run only R9.4.1 but there was a shipping error by ONT and we ran with what we had. Only after that misadventure did ONT release R10.3 models for Bonito, so I am tempted to re-attempt this and see if I have truly learned how to set up the GPU instances or I will flail away again.
It does occur to me that this exercise is well within the range of an undergraduate student, and there are certainly publicly available Nanopore datasets for E.coli or the Zymo Mock Community which could be aligned and a subset used to generate the sort of data above.