The dataset from the Albertsen lab preprint (many thanks to Søren Karst for letting me know the new data had been posted) are both raw reads and the UMI-driven consensus sequences for E.coli MG1655. MG1655 has 236 homopolymers of 8 or longer -- 216 octameric repeats (206 of those A/T), 20 nonameric repeats (all but one A/T) and 1 single decameric repeat (surprisingly, a C/G repeat).
Our internal dataset is even farther from being perfect. In an ideal world I would have had a set of carefully designed test articles to explore. Instead, we ran some plasmids which were convenient and with well-validated reference sequences. Among the plasmids were a number of long A/T homopolymers; generally people avoid putting in long G/C homopolymers for fear of G-quartet structures.
Our internal dataset is even farther from being perfect. In an ideal world I would have had a set of carefully designed test articles to explore. Instead, we ran some plasmids which were convenient and with well-validated reference sequences. Among the plasmids were a number of long A/T homopolymers; generally people avoid putting in long G/C homopolymers for fear of G-quartet structures.
Now, of course, both sets of data are bit obsolete -- each run on R10.0 and now R10.3 is supposed to be slightly better. But it is still and interesting look in my opinion and also a call for ONT -- and everybody else in this space -- to broaden their scope when talking about homopolymers. For example, last year Genapsys ran E.coli as their test set and could report limited data on homopolymers longer than 6.
If we look at raw data, we see that sometimes the basecallers do quite well, but other times there are systematic errors. Here are plots with the deviation from correct length of the homopolymer on the X axis and fraction of the reads showing that as the Y. I've broken out for each one the purine and the pyrimidine as separate lines. They're all centered on 0. For example, here are all the octameric C/G repeats.
If we look at raw data, we see that sometimes the basecallers do quite well, but other times there are systematic errors. Here are plots with the deviation from correct length of the homopolymer on the X axis and fraction of the reads showing that as the Y. I've broken out for each one the purine and the pyrimidine as separate lines. They're all centered on 0. For example, here are all the octameric C/G repeats.
Many of these are shifted left -- calling a homopolymer short is by far the dominant error mode, though we can see some in which one base is consistently called correctly, such as the second from the left on the bottom row where C is usually gotten right -- but the G run on the opposite strand is quite off.
Here's the same sort of plot for the first 60 A/T octameric runs
Again, some work very well but there are some with consistent errors of one or both.
Here's all the E.coli nonamers -- since there's only one C/G I just plotted it with all the A/T
Here's the same sort of plot for the first 60 A/T octameric runs
Again, some work very well but there are some with consistent errors of one or both.
Here's all the E.coli nonamers -- since there's only one C/G I just plotted it with all the A/T
Again, variation -- presumably due to local context. T seems to lag (called consistently short) more than A, which is interesting.
There's only the one decamer -- and the basecalling did very poorly here with neither base showing a peak at 0 deviation.
For our internal dataset, we had many fewer homopolymer sites. But we do have an eleven!!
[2020-01-29 -- discovered I misspelled Tufnel in the title! Aaarrrggghhh!]
There's only the one decamer -- and the basecalling did very poorly here with neither base showing a peak at 0 deviation.
For our internal dataset, we had many fewer homopolymer sites. But we do have an eleven!!
Going back to the Albertsen group's data, we can ask how well can their pipeline, which uses Racon and Medaka, deal with this systematic error. The answer is: surprisingly well. If I align the UMI-based consensus reads to MG1655 and then look at homopolymers of 8 or longer, only three regions had less than half the consensus reads at the correct length. For one A/T octaamer and one A/T nonamer, only 1 of 3 consensus reads had the correct length; in each case both the other reads were one base short. At the C/G dodecamer, only 1 or 4 reads was of correct length; 2 were 1 short and one was 3 short.
So R10.0 seems to work remarkably well for consensus generation even on relatively long homopolymers, though there's still room for improvement. I'll renew my call for synthetic sequences to exhaustively probe performance in these problem regions. If we call the three bases before and three bases after a homopolymer the context, then there's only 2304 of each homopolymer type to make, Throwing in some padding and such, that's less than 200K of total test sequences to synthesize to cover all of a size class -- which these days really isn't an absurd amount of synthesis. Okay, the synthesis folks might gripe about those homopolymer runs, but clever approachs should be able to deal with these. Alternatively, one could look at other genomes to use as references and pick some with very skewed G+C content to get more long runs. Given how much nanopore basecalling and polishing depends on context and training of deep learning models, it would seem (in my opinion) critical to ensure we have a broad training set.
[2020-01-29 -- discovered I misspelled Tufnel in the title! Aaarrrggghhh!]
7 comments:
As revealed at their recent event, 10.0 was a preview version, 10,3 is the release with improved S;N. Surely analysis should be on the release not the beta.
One works with the data that is available. ONT's platform would make the Red Queen dizzy - waiting for stability is a fool's game.
My analytic code seems to be stable now & perhaps 1-2 refactorings away from something I'd let other people see. In any case, it's quick to rerun it if any other datasets show up or new R10.x basecallers are released
With apologies for asking about a trivial bit, but what does Tuftnellian mean?
I was wondering that also
It should have been Tufnelian -- I failed to check spelling. It's a poor joke of a reference to one of the characters in the mockumentary comedy This Is Spinal Tap -- Nigel Tufnel at one point explains to the interviewer that the band is so loud because their amps have volume knobs that "go to eleven". When asked why not just make ten louder, "these go to eleven"
Clip on YouTube
Named for Nigel Tufnel (Character from Spinal Tap) who thinks that an 11 on his amplifier means the music will be louder than if at 10.
Here's the Spinal Tap piece.
Post a Comment