Quality scores are a useful adjunct to sequencing data and are commonly expressed as phred scores, which are the integer part of -10*log10 of the error probability. Any base caller needs to estimate these and many downstream programs, from aligners to assemblers to variant callers, rely on these quality values for their operations. In many cases, the individual quality scores are combined to generate some joint estimate of the error (I built one such model at Codon). These error probabilities come not from an infallible source, but are rather estimated from aspects of the raw data.
Now, if we have a sequencing dataset for which we know truth, we can compare the actual error rates with those predicted by the quality scores. Ion Torrent uses sequencing of E.coli DH10B as such a dataset. DH10B is useful for this because (assuming perfection in the aligner) it should have no polymorphisms; deviations from the reference should all be errors. I've had access to a few such datasets for over a month, but from a third party who made me swear not to share any details about them. Luckily, Ion now has on their site what is apparently a recent run on the 314 chip with well over 500K reads. In addition to the raw data, they provide alignments generated with their TMAP program.
I'm going to spew out a bunch of plots, but first I'll describe them. Each plot is comparing the difference between the observed quality score and the called quality score. If this delta is positive, it means the real quality is better than observed. I call this case "pessimistic prediction"; the base caller was undercalling the quality. On the other hand, if the value is negative then we have an "optimistic prediction"; the caller thinks it is doing better than it is. A perfect caller would always score 0. One final note: to compute the error frequencies a pseudocount of 1 was added to the error counts; otherwise bins with no observed errors blow up in the log calculation.
In the plots, the X-axis is sequence read position and the Y-axis is the quality delta value. Each plot is further trellised by the quality value. Now, not only does this show the delta plots, we can also see a lot about the quality profile across a run. Also, only bins with at least 200 events are shown (and point sizes are proportional to the log2 of the number of events).
So first, let's look at this plot for TMAP. The first thing I noted is that most of the values are above 0; the caller is being too pessimistic. The caller is often being extremely pessimistic; delta values are often 20 or more. So, for example, this means that some of the lowest quality calls (qs=3) near the beginning of the read are actually more like 15-25. The high end is less affected, but still so; the maximum quality of 33 really should be more like 36 in places.
A curious bit is the structure of some of these plots. I would have expected smoother curves, though perhaps that is naive. Some have discontinuities or multiple peaks. Presumably much of this structure are artifacts of the base caller's quality prediction algorithm.
Also note that very high quality values are seen only at certain parts of the run: the highest quality value is 33 and isn't seen after position 32 in a run.
A curious bit is the structure of some of these plots. I would have expected smoother curves, though perhaps that is naive. Some have discontinuities or multiple peaks. Presumably much of this structure are artifacts of the base caller's quality prediction algorithm.
Also note that very high quality values are seen only at certain parts of the run: the highest quality value is 33 and isn't seen after position 32 in a run.
Is Ion's caller ever overestimating quality? This is easier to see with the plot filtered in Spotfire so that delta values greater than -1 are removed (above; -1 is pretty much accounted for by the rounding error in generating integer quality scores). According to these plots it is rare and mostly (and is most intense) at the beginning of the reads. That's a suspicious position, which we'll look at more below. The one exception to this pattern is quality score 30 at position 24, which is is really more like 28. Not off by much and perhaps this is just the sort of noise this fine tooth comb will find.
Now, when I saw the same beginning of read effect in the earlier data, it was suggested that maybe this was an alignment issue related to homopolymer runs; a gap in the beginning of the sequence might not be detected, causing the beginning of the read to be misaligned. Indeed, I was using a non-gapping aligner (Bowtie) in that first analysis & a little customer end-adjuster confirmed that inserting a few gaps could drastically improve things. What if we try a better aligner? TMAP is meant to be optimized for Ion Torrent data, but is it?
My aligner I use in this situation is BWA-SW. On this dataset, it aligns slightly fewer reads than TMAP (485707 vs.508989), or about 3% (there's also 5 reads aligned by BWA-SW but not TMAP).. The plots are now remarkably better (BWA-SW in gold). The end effect goes away entirely, and the plots overall are in BWA-SW's favor; the gold dots are nearly always above the blue dots.
It looks like it could be a major project to really scrutinize the differences between the TMAP and BWASW alignments. Spot checking a few, I am seeing some TMAP alignments that are curiously aggressive. For example, in one the initial base is matched and then the next 4 bases of the read are deleted, followed by 95 matching bases; BWA-SW calls for the first base to be skipped (unaligned) and then 95 bases of match. Another small observation is that in the BWA-SW alignments which leave the first base unaligned and it is part of a homopolymer run, an extra T seems to be called more often than an extra G and each more than an extra A or C. I was guessing extra Gs would be most common, given that the last base of the key sequence is a G and there might be a problem counting after G. The pattern holds for cases where BWA-SW thinks the first base should be skipped but it is not followed by an identical base; but over 50% of the time it is a T which should be skipped over and the pattern holds even if you remove homopolymers. Very strange; one guess is that this is an artifact of the library preparation protocol (which I am not familiar with in depth) and that the T really is there, but didn't come from DH10B.
It looks like it could be a major project to really scrutinize the differences between the TMAP and BWASW alignments. Spot checking a few, I am seeing some TMAP alignments that are curiously aggressive. For example, in one the initial base is matched and then the next 4 bases of the read are deleted, followed by 95 matching bases; BWA-SW calls for the first base to be skipped (unaligned) and then 95 bases of match. Another small observation is that in the BWA-SW alignments which leave the first base unaligned and it is part of a homopolymer run, an extra T seems to be called more often than an extra G and each more than an extra A or C. I was guessing extra Gs would be most common, given that the last base of the key sequence is a G and there might be a problem counting after G. The pattern holds for cases where BWA-SW thinks the first base should be skipped but it is not followed by an identical base; but over 50% of the time it is a T which should be skipped over and the pattern holds even if you remove homopolymers. Very strange; one guess is that this is an artifact of the library preparation protocol (which I am not familiar with in depth) and that the T really is there, but didn't come from DH10B.
One last observation. This dataset had 260 flows (including the 4-base key sequence) and a handful of reads were as long as 112, but there were few of them. With BWA-SW, the observed quality scores at positions of 100 or greater, for those bins with at least 500 counts, were well above 20. This would argue that the run was really stopped well short of its maximum potential; additional flows would have yielded quality data (though I won't attempt to estimate how many would make sense)
This are some preliminary observations; it wouldn't shock me if I find some condition I haven't accounted for which contributes to this and which isn't interesting. Also, the impact of these observations on an experiment will depend on the experiment. For example, in a non-barcoded amplicon experiment the precise quality of first 20 or so bases is nearly irrelevant, as these are part of the targeting sequences and therefore any variants seen are far more likely to be oligonucleotide synthesis errors than real polymorphisms. For a barcoded design, on the other hand, the barcode is likely in that beginning stretch and must be designed to account for these issues. Fusion primer designs would avoid the "skip T" issue described above, if that is an artifact of the ligation process. On the other hand, if that really is an artifact of the sequencing platform it will show up again. Applications such as de novo assembly, which have the least information to diagnose problems, should be the most cautious. In any case, a good rule is to make sure any variant you are calling is supported by a diversity of read positions; variants always called by the beginning or end of a read are suspect (I've seen this in another experiment).
This are some preliminary observations; it wouldn't shock me if I find some condition I haven't accounted for which contributes to this and which isn't interesting. Also, the impact of these observations on an experiment will depend on the experiment. For example, in a non-barcoded amplicon experiment the precise quality of first 20 or so bases is nearly irrelevant, as these are part of the targeting sequences and therefore any variants seen are far more likely to be oligonucleotide synthesis errors than real polymorphisms. For a barcoded design, on the other hand, the barcode is likely in that beginning stretch and must be designed to account for these issues. Fusion primer designs would avoid the "skip T" issue described above, if that is an artifact of the ligation process. On the other hand, if that really is an artifact of the sequencing platform it will show up again. Applications such as de novo assembly, which have the least information to diagnose problems, should be the most cautious. In any case, a good rule is to make sure any variant you are calling is supported by a diversity of read positions; variants always called by the beginning or end of a read are suspect (I've seen this in another experiment).
10 comments:
Great post, thanks. I was looking at the Ion Torrent website for the data you mention, but couldn't find it. Do you have a link?
Interesting. Did you calculate what the raw substitution/indel error rate in the reads is?
Two documents relevant here on the Torrent Dev site; registration may be required. It wouldn't shock me that non-owners have some extra hoops to jump to get registered correctly; that's a gripe for another day (I'm not an owner, but already jumped the hoop).
The actual dataset, as a gz-compressed tar file containing SFF, BAM & FASTQ (and a few other things).
Ion Personal Genome Machine™ Performance Overview describes performance enhancements, and the plots I believe are derived from this dataset. There is a raw error rate plot there (I think it includes indels). I need to analyze the indels in more depth.
I think it would be more interesting to look at observed quality and error rate than quality deltas. That would give a better estimation of the performance of the machine.
You can find the raw error rates in the above-referenced doc.
My interest in the deltas is more application-focused - if you can trust the quality scores, then you can make estimates of the level of belief in observed variants.
I can't because, as you say, it's registration only and I don't think I'd pass the "registration hoops".
If your interested in the accuracy of the quality scores I think probably be better off looking at similar Q predicted against Q observed plots, then break that down by position/other feature...
Thank-you Keith for your insightful observations. In fact, one of the underlying algorithms in TMAP is the BWA-long algorithm (bwasw). TMAP has many useful settings, and the clipping end-effects you observe are a result of some of those settings, including the scoring parameters (match, mismatch, gap open, and gap extension) and the desired soft-clipping settings (any inclusion/exclusion of 3' and 5' soft-clipping). These settings can be modified to remove these effects and obtain equal or better results that you show for BWA-long (bwasw). For more information and discussion, see http://ioncommunity.iontorrent.com/.
@new: I couldn't access the file yesterday, but I can today (one day after registering)...
What happened to your last post?
For those of us who would be interested in repeating the same sort of analysis on in house data, any chance you'd be willing to share scripts. etc. you used to generate the plots, etc. from the bam file?
Post a Comment