Monday, May 06, 2019

Genapsys' Base Caller: Mysterious, But Not Ideal?

When I wrote about Genapsys' pre-print on their sequencing system the other night, I intended that to be the last I wrote until some major news from them.  But after launching that into the great Internet ether,  I found myself lying awake wondering if a very simple idea had any merit.  Painfully simple -- I almost didn't pursue it because it was so simple and obvious.  But, it turns out it appears to have merit -- there may be an obvious route to improving the accuracy of Genapsys' basecalling on homopolymers.  And that also took me into ground I've thought about before -- going back to my first year at Codon Devices over a decade ago -- on the challenge of modeling sequence quality.

To review: Genapsys has a preprint which is a bit terse on details -- except when they are absent.  Actually, the basecaller falls into that latter category -- described only as their proprietary pipeline, with no hint to even the class or classes of algorithms used.  Their preprint doesn't dive deeply into particular types of errors, but I did and explored the homopolymer calling using the counterr tool.  For homopolymeric reference sequences of up to 6, their caller gets it right -- if the sequence is really 6 in a row of something it will come out as 6 of that.  However, as I realized after starting this new pursuit, the converse is not true -- if the caller gives 6 it could definitely be a 7-base homopolymer or even longer.  Indeed, significant number of 7s will be called as 5.

The painfully obvious idea was to see if there is any dependence of homopolymer miscalls based on the position in the read.  I've also been learning pysam recently for a work project, so I had familiarity with the right tool to attack the problem.  So I've generated plots based on the error rate based on knowing the called length of the homopolymer (top) or the reference length of the homopolymer (bottom)


As you can see, in each case the error rate goes up as we move along the read.  So there appears to a signal to be exploited to improve homopolymer calling.  It is more difficult with the real situation -- for example 6-base homopolymer calls are dominated by the frequent and correct actual 6-mers with only a bit of invasion by the called-short 7 and greater runs.  Still, an awful lot of machine learning is successfully applied to much more subtle signals with far less available training data.

Ideally this would be in the basecaller.  I'm really curious now what this proprietary caller is.  Given the current age, I would have imagined building some sort of machine learning model that takes the signal prior a flow and the delta of the signal and trains on getting the correct number of bases as the output. That could, of course, be further divided by which base is being called -- train four different models.

I'm a bit annoyed and grateful that Genapsys chose to make public only FASTQ dataset (still pure annoyed it's only E.coli) -- annoyed because if they had released raw signal values I'd be able to try building such a model myself and grateful because if they had released such data I'd be immersing myself in another side project I don't really need.

Alternatively, one could imagine a polishing module or a specialized component of a consensus caller or variant caller that would incorporate the length dependence -- and the error profile -- into calls.  For example, the data suggests that the system when it errs essentially always goes short.  So if there are early-in-the-read base calls of 7 and lots of 6s, that homopolymer is at least 7 -- because 6 is never called as 7.

Genapsys uses native, unterminated nucleotides and such chemistries -- 454 and Ion Torrent -- have a reputation for homopolymer issues.  If I were at Genapsys I'd be trying to put my absolute best foot forward in this department.  Indeed, their AGBT presentation back in 2014 made some strong claims in this department (if memory serves, it was that their signal was linear for up to 20 of the same nucleotide).  So I'd have poured a lot of effort into exploring different calling models -- that there seems to be a clear read-position signal in the errors suggests that they did not do the same.

One of my early projects at Codon was to devise rules for combining phred scores and choosing thresholds for declaring a construct confirmed.  Phred values are an elegant system for encoding the probability that the called base is correct -- but I realized that it doesn't really model the issue of should a base have been called but wasn't?  In other words, how confident are we that the current base transitions to the next base or not?  With Sanger sequencing, which is what I was working with then, there is a somewhat-reasonable idea that if your raw phred scores are low then you are far less confident you don't have indels and conversely high scores mean a low probability of missed bases.  But this doesn't really hold for other platforms.  Ideally there'd be a way to encode all that, but it would yield a ton of data that probably most people (sadly, probably including myself) would generally find too much trouble to deal with.

How much could Genapsys improve on their basecalling performance?  Without performing the exercise, it's impossible to know.  There's also the possibility I've made some trivial error that explains these plots.  But the shape of the curves suggests this isn't the case -- if the signal were only at the very extreme of read length that might apply but it really does appear to be a trend that runs the entire way.  If you're going to try to crack into the crowded sequencing market, you don't want to miss a single detail, especially one related to a potential opinion bias against your platform from the get-go. 

[2019-05-07 -- fixed axes labels for plots -- Incorrect not Correct! ]


1 comment:

Anonymous said...

This is likely artifact of their detection modality. Genapsys claims to the contrary notwithstanding, their accuracy absolutely depends on the read position. The signal they measure is a change of capacitance on the background of the capacitance of the already synthesized strand. Their sensors are not small enough to measure the cap of the PTJ only. The same effect likely limits their read length too, despite what they may claim. Overall, Genapsys' trqck record of over promising and underdelivering makes ONT look like you could take their release promises to the bank :)