One of the most widely-publicized error modes with Ion Torrent and 454 sequencing has been the challenge of correctly counting the number of bases in homopolymer runs. Because these chemistries use non-terminating nucleotides, polymerase is free to add as many as possible. Unfortunately, the signal linearity breaks down, making it difficult to correctly count. Ion Torrent today released a note on homopolymers, but rather than plowing this well-trod ground it goes for a less publicized problem: Illumina having a more specific challenge in this department. The note is available on the Ion Community, free registration required.
In particular, they are focused on MiSeq, but you can bet these sorts of comparisons will be extended to HiSeq as the Proton rolls closer to the finish line. As pointed out by the application note, issues with base-calling near homopolymers have been cited before. The specific issue is while Ion and 454 have difficulty counting the length of the run, leading to indel errors, Illumina seems to have difficulty getting the base after the run correct, often trading the correct call for one more call of the homopolymer, as illustrated by my error in the title. And it really is the base after the homopolymer; the effect is directional. It is also substantial, accounting for about 1/2 of the base substitution errors in an E.coli dataset. The error mode was also observed frequently in a human amplicon dataset.
So, MiSeq (and presumably HiSeq and GAIIx) appears to have a systematic homopolymer-related problem of its own. How quickly will software developers respond? The simplest "fix" is to simply disregard calls just after homopolymers, but that's not very satisfying. Given the strand-specific nature, it should be possible to adjust base callers and such to down-weight the final base in a homopolymer. If this can be done effectively, then a lot of gain could be had: eliminating half the miscalls in a large dataset. In the simplest case, FASTQ files could be pre-processed to down-weight the quality score of the last base in a homopolymer. Calibrating such adjustments could be tedious to execute, but likely yield better variant identification.
As an aside, that touches on a question I've wondered a bit about. In a flow-based system such as 454 or Ion, a single single is decomposed into a run of bases. When doing so, how should the quality scores be set? In a sense, the Phred-style quality scores aren't really intended for this; the quality score is an estimate of the probability that a base was called correctly, not whether the base should be called at all. Most software worrying about this probably looks at the more-native SFF format, but one could also imagine an attempt to annotate homopolymers with a probability distribution for the length of the run.
Undoubtedly other systematic errors exist for both platforms; troubles for Illumina with specific G+C-rich motifs have been reported previously. Competing manufacturers would appear to have an incentive to pick at the flaws of their opponents, but if they are smart they'll spend just as much time focused on their own platform.. Of course, independent researchers will be the most trusted. Such comparisons can yield interesting fruit; a recent comparison of Illumina vs. Complete Genomics noted that on a human sample each platform yielded unique variants which could be validated. Running samples on multiple technologies is likely to be cost-prohibitive on a routine basis, but for some critical samples or confusing results that might be the way to go.
One final thought: this analysis was enabled by the existence of public datasets of E.coli strains run on each platform. Those are useful comparators, but E.coli certainly doesn't provide as much breadth in a performance benchmark as would be really desirable. Obvious other choices would be an AT-rich bacterial genome and a GC-rich bacterial genome, but even these would lack potentially troublesome features (such as simple nucleotide repeats with repeat units longer than one). As the capacity of the benchtop sequencers grows, perhaps small eukaryotic genomes could be used. Alternatively, perhaps some enterprising benchmarkers could create a pool of BACs which together are E.coli-ish in size, but which span a wider range of challenging DNA contexts.