I'll confess to being very shallow in my understanding when it comes to deep learning -- we've clearly hit a wonderful conjunction of machine learning technology advancement and biological data abundance that these methods are making strong headway in numerous areas. Using deep learning for basecalling and consensus improvement (as well as modification calling) has long been practiced at PacBio rival Oxford Nanopore. I'll go light on the details of the deep learning models, as I don't feel confident I can do better, but a bit deeper on things more in my comfort zone
One preprint (Lai et al) is from NVIDIA (with one author also holding a position at USC) and PacBio and the other (Baid et al) from Google and PacBio. I'll refer to each by the first author and company, as the author should get credit but my brain holds onto the companies better. Both papers have built HiFi CCS polishers -- they take subreads and a built CCS read and try to build a better CCS read rather than build a CCS read from scratch. Since the CCS reads are already very good, this makes sense.
The Lai/NVIDIA paper has a bit less to it and the code they make available doesn't include pre-built models, but is still interesting on its own and then more interesting in context with Baid/Google. Each position in the alignment ("pileup") is encoded as a tensor with the following elements: the number of each A,C,G,T on each strand (8 elements), the number of insertions on each strand (2 elements), the consensus base in a "one-hot" encoding style (5 bits, one for A,C,G,T,-) and the consensus quality -- 16 total. I'm a bit confused by why only insertions are encoded and not deletions, though since deletions are a dominant error mode in the raw subreads perhaps this makes sense -- but some explanatory text on that point would improve the manuscript. The overall pileups thus encoded are then broken into 1000 column chunks, with 200 column overlaps between adjacent chunks.
The authors do mention that they tried a complete one-hot encoding of the entire pileup, but didn't make progress on this. Figuring out how to convert complex data into features that can be trained on does seem to be one of the key bits of challenge in machine learning.
The figure below shows the improvement in read accuracy vs. the number of subreads. It is curious that they chose to lump into one bin everything over 23 subreads, which showed the greatest effect. It would be very interesting from a planning perspective to understand how the curve continues. This is because the number of subreads is roughly a function of the length of the inserts and the amount of instrument time used to collect data. If you really want to turn over the instrument to get more runs in (which means using more PacBio consumables, something the company would like you to do!), you might want to run shorter movies -- but how much shorter? Conversely, to push HiFi to greater lengths one would like to know how much longer inserts (and therefore fewer subreads) would affect quality. Basically the same question, but as a user you have two levers that might alter the number of subreads going into each CCS read.
There's some discussion of different model architectures attempted and a supplementary table compares the results. I won't try to dissect that bit for fear of mangling it badly.
One interesting note is that normalizing the base counts to 1 rather than using raw counts leads to slightly worse performance. I suppose that isn't very surprising, since such a normalization obscures the overall confidence in the estimates. Perhaps some other function for transforming the raw counts would lead to interesting results, something that downweights small number of counts but perhaps flattens the top of the curve. Maybe even just truncating the results -- or having the totals for a column as yet another element in the tensor.
Table 1 in the paper presents an excessively condensed table based on comparing 3,799 HiFi reads to a GIAB reference. Running the program took 1:24 on a single NVIDIA Tesla V100 with 16Gb of memory. It's also frustrating that the model used to calculate phred qualities for reads isn't specified.
In the figure below, the effect on insertions and deletions in homopolymers is shown. I'd like to see these figures replotted with a log Y axis, as I'm wondering if there are effects on the longer homopolymer lengths that are obscured by the fact that there just weren't very many examples as the short lengths and so the dynamic range for visual comparison is far too compressed. But a lopping off of short deletions clearly happened -- roughly 25% of the 2 and 3 base deletions (curiously no change for 1) and a more modest effect for 4 and 5. For insertions a sharp drop at 1 and visually the lines don't converge until about 7
Another run used E.coli as the target genome and saw somewhat similar results, though ultimately E.coli just doesn't have homopolymers much above 10 long.
This paper throws a lot of data at their machine learning algorithm -- the subreads are divided into 100 basepair partitions and that is "transformed into a tensor" which includes not only the sequences, but also the two key polymerase kinetics parameters which are available in PacBio files, the pulse width and the interpulse duration as well as the signal-to-noise ratio for each nucleotide. The manuscript says "outputs for each 100 bp partition in the full sequence are stitched together to produce the polished read", but it isn't really explained how this happens. It would appear the partitions are non-overlapping, so what happens to possible errors in the junctions? Some more explanation would be very useful here.
I'm sadly lost much of the time trying to understand the details of the model architecture, but one interesting bit is that they cap the number of subreads in use to 20. Their results seem to justify this cap, but how are they selected? Ten from each strand? Random draw?
When run on actual human data, DeepConsensus (as they call the program) generally does much better than the original results from the pbccs program. But not always -- it would be very interesting to look for patterns in the reads that it does worse on -- particularly that little column on the right of the left figure where pbccs came out perfect and DeepConsensus didn't. But certainly mostly there is improvement
A side note on this: in a pre-print on a HiFi-oriented assembler (La Jolla Assembler) by Pavel Pevzner and colleagues the authors comment that dinucleotide repeats are another class of problem sequences for HiFi reads, which rings true to me. It's another class of error worth pulling out in these sorts of analyses.
The authors then go on to show improvement in HiFi assembly contiguity using the better polished reads; quality matters! It takes some staring at the figure below to find the corresponding lines, but in this case using the human data from two SMRT cells there is a very large gain in contiguity; another figure shows for three SMRT cells and the impact is noticeable but not as dramatic
There's also discussion and supplementary tables on improvements in quality -- about a 2 phred unit improvement in agreement with kmers but more importantly 43% less total (false positive+false negative) errors when variant calling. There are some figures breaking these down, and the biggest impact appears to be reducing false negatives for both SNPs and small indels and again much more impactful on the 2 SMRTcell dataset than the 3 SMRTcell dataset. Annoyingly, these plots use different Y axes for the plots, so you can't visually compare the different plots -- if you look at the numbers three SMRTcells with pbccs is still better than two SMRTcells and DeepConsensus
The last big experiment in the paper is one I keep oscillating on how interesting it is. At one level the overall conclusion is that longer HiFi reads yield better assemblies. Did you ever hear that water is wet? And there is a small but measurable increase in variant calling -- but that's in part because there is 16% more data in the 24 kb library than the 15 kb library. Strangely, this section doesn't compare pbccs to DeepConsensus for these libraries; presumably they would show a huge improvement in the assembly of the 24 kb library with DeepConsensus because many pbccs reads wouldn't make the quality cutoff upstream of the assembler.
The full DeepConsensus kit is available on GitHub. I've successfully run it on their supplied BAM files, but when I tried to subset some of mine the code bombs out with a most cryptic Python error message. Unfortunately further play with this keeps getting shoved aside for more pressing topics, but someday I'll get back to it.
Both of these papers show that computational pipeline improvements could extract additional value out of PacBio data, enabling better assemblies and variant calls. A really great follow-up to the DeepConsensus paper would be something looking at how far one can push HiFi insert sizes with this code -- could we see 40Kb HiFi? 50Kb HiFi? Of course it isn't easy making libraries that large, but since PacBio data yield is strongly influenced by insert length, this would be very valuable territory. Plus I have some projects that could really use some 60-70Kb HiFi reads.
These are both research papers and my trouble with DeepConsensus is an indicator of this. For the PacBio community to be able to systematically leverage this will require making the code robust. There's also the question of compute power and resources and a very poor software design choice on the PacBio Sequel IIe. The only difference between the IIe and its predecessor is that the IIe has onboard compute power for HiFi generation -- but somebody decided it was a good idea to force users to have onboard HiFi generation exclusive or export the subreads. This is one of many situations where that design choice (which since it is in software, could be fixed very quickly!) is extremely unfortunate -- one must forgo the compute power of the IIe if one wants subreads for DeepConsensus or for looking for heteroduplexes in amplicon libraries and so forth. PacBio has a big virtual user group meeting coming up soon (need to check if I actually registered!)
It's interesting to see that DeepConsensus uses some additional information -- the pulse width and interpulse duration and signal-to-noise value -- which the other approach didn't and I believe are not used by pbccs. It certainly makes sense that this information might particularly help in resolving homopolymers. But the Lai/NVIDIA paper also saw significant boosts in accuracy, so how important is it? How do these methods compare on the same sequences, and are there differences in the error spectrum that might point to even better encodings of the data for training?