Monday, January 30, 2012

Does Illlmina Also Have A Homopolymer Problem?

One of the most widely-publicized error modes with Ion Torrent and 454 sequencing has been the challenge of correctly counting the number of bases in homopolymer runs.  Because these chemistries use non-terminating nucleotides, polymerase is free to add as many as possible.  Unfortunately, the signal linearity breaks down, making it difficult to correctly count.  Ion Torrent today released a note on homopolymers, but rather than plowing this well-trod ground it goes for a less publicized problem: Illumina having a more specific challenge in this department.  The note is available on the Ion Community, free registration required.

In particular, they are focused on MiSeq, but you can bet these sorts of comparisons will be extended to HiSeq as the Proton rolls closer to the finish line.  As pointed out by the application note, issues with base-calling near homopolymers have been cited before.  The specific issue is while Ion and 454 have difficulty counting the length of the run, leading to indel errors, Illumina seems to have difficulty getting the base after the run correct, often trading the correct call for one more call of the homopolymer, as illustrated by my error in the title.  And it really is the base after the homopolymer; the effect is directional.  It is also substantial, accounting for about 1/2 of the base substitution errors in an E.coli dataset.   The error mode was also observed frequently in a human amplicon dataset.

So, MiSeq (and presumably HiSeq and GAIIx) appears to have a systematic homopolymer-related problem of its own.  How quickly will software developers respond?  The simplest "fix" is to simply disregard calls just after homopolymers, but that's not very satisfying.  Given the strand-specific nature, it should be possible to adjust base callers and such to down-weight the final base in a homopolymer.  If this can be done effectively, then a lot of gain could be had: eliminating half the miscalls in a large dataset.  In the simplest case, FASTQ files could be pre-processed to down-weight the quality score of the last base in a homopolymer.  Calibrating such adjustments could be tedious to execute, but likely yield better variant identification.

As an aside, that touches on a question I've wondered a bit about.  In a flow-based system such as 454 or Ion, a single single is decomposed into a run of bases.  When doing so, how should the quality scores be set?  In a sense, the Phred-style quality scores aren't really intended for this; the quality score is an estimate of the probability that a base was called correctly, not whether the base should be called at all.   Most software worrying about this probably looks at the more-native SFF format, but one could also imagine an attempt to annotate homopolymers with a probability distribution for the length of the run.

Undoubtedly other systematic errors exist for both platforms; troubles for Illumina with specific G+C-rich motifs have been reported previously.  Competing manufacturers would appear to have an incentive to pick at the flaws of their opponents, but if they are smart they'll spend just as much time focused on their own platform.. Of course, independent researchers will be the most trusted.  Such comparisons can yield interesting fruit; a recent comparison of Illumina vs. Complete Genomics noted that on a human sample each platform yielded unique variants which could be validated.  Running samples on multiple technologies is likely to be cost-prohibitive on a routine basis, but for some critical samples or confusing results that might be the way to go. 

One final thought: this analysis was enabled by the existence of public datasets of E.coli strains run on each platform.  Those are useful comparators, but E.coli certainly doesn't provide as much breadth in a performance benchmark as would be really desirable.  Obvious other choices would be an AT-rich bacterial genome and a GC-rich bacterial genome, but even these would lack potentially troublesome features (such as simple nucleotide repeats with repeat units longer than one).  As the capacity of the benchtop sequencers grows, perhaps small eukaryotic genomes could be used.  Alternatively, perhaps some enterprising benchmarkers could create a pool of BACs which together are E.coli-ish in size, but which span a wider range of challenging DNA contexts.


Anonymous said...

This being an election year, it seems quite appropriate to call out the other "guy" when you have a challenge to address. Also, software is never a best way to deal with a data issue.

Anonymous said...

seriously, NGS is clearly ready for the masses... NOOOTT !

ECO said...

Great post as always Keith. I would _love_ to get a set of synthetic difficult templates designed...back and forth questions like this between two companies are much more easily settled with neutral data.

Paul T Morrison said...

I second that great idea of creating an ecolish sized pool of BACs with problematic sequences embedded. I might even mention your name when I propose this as something the NGS group in ABRF should do.

Anonymous said...

May be irrelevant at this point, but SOLiD performed better at homopolymers and indels. Would be nice to make the parallels comparison with the E. coli data sets released for this platform. It seems LT forgot to do this (or they don't care anymore).

Anonymous said...

I always wondered why there is so few high-quality cross-platform comparison of complex sequences. Best you usually get is how many variants Platform A found, how many Platform B and how many were shared. But how close to reality both platforms are is often not told.....In this context, the ArchonX-Prize will run on partially pre-sequenced human genomes, so there will be high-quality reference sequences to compare the results to.

As to the "post-homopolymer" issue in MySeq: If you have enough coverage to sequence the critical homopolymers in both directions, why not just disregard the next base AFTER a homopolymer altogether, as long as there are enough reads of that base in the other direction (where it is the last base before the homopolymer and should not be affected)?


Anonymous said...

GC-related problems with 454 have been shown before, see e.g.

I'd assume Ion Torrent to have similar issues, as well..

Søren Mønsted said...

I would guess most variant callers require presence in reads from both strands, in which case the problem is not as pressing as Ion Torrent's homopolymer problems (which cannot be handled this way)

Unknown said...

Do you guys have a paper about Illumina homopolymer run fro Ion torrent that you mentioned?