In the post, I mentioned two tools I have used for upstream processing of Illumina data prior to assembly: FLASH and MUSKET. FLASH is a tool which compares each pair of reads from the same fragment and looks for an overlap; if it find the overlap it merges them, recomputing the sequence qualities in the overlap (since you now have more information, the uncertainty around the base call goes down so the quality value goes up). MUSKET analyzes k-mer frequencies in the input data and uses this to trim and correct reads; if a k-mer is rare then it is likely a sequencing error.
Now, my order was to first use MUSKET to correct and then FLASH to merge reads. Originally this was an arbitrary decision, but I had to some degree retro-justified it as "if MUSKET corrects errors first, then FLASH should find more correct overlaps". A related issue brought up in a paper on a related tool called COPE (more later) is that if FLASH finds a mismatch in the overlap region and both calls have the same quality value, it must randomly pick which to call the new base (presumably giving that a terrible quality value, as our uncertainty has just shot up!).
What Michael Schatz (via Twitter) and an anonymous poster here proposed is running FLASH first and then MUSKET. The more I think about it, my logic seems to justify this nearly as well: if FLASH generates an error, MUSKET will probably correct it.
COPE, by the way, claims to improve on FLASH by using k-mer information to assist in guiding the overlapping of the reads; given a choice of possible consensus sequences, it favors those that will avoid rare k-mers. In their paper, they report better performance than FLASH on test data, but they didn't correct with MUSKET, did they? So if the data is pre-corrected with MUSKET, or post-corrected with MUSKET, how much different are they?
These examples illustrates a pervasive challenge in sequence processing (but hardly unique to it), that there are a multitude of tools and a multitude of orders they might be applied in (and, of course, many have free parameters to boot). There a very large combinatoric space of possible pipelines to explore. Plus, different assemblers already have some degree of these tools built in whereas others may not, so pre-processing pipelines may have different impacts on downstream assemblers. A very stark example of this: my first attempt (and so far only) to use the digital normalization feature of khmer yielded a horrible Ray assembly; apparently I had reduced coverage far too much.
Alas, the only way to determine is to test it empirically. Indeed, I thought about waiting to write this until I had explored a number of these on some datasets where we are confident that for at least some of the assembly we know "truth". But, given the challenge of completing such a study promptly, I forged ahead. With luck, sometime early in the new year I can follow-on with actual results.
To reiterate the call from yesterday: if exploring a huge combinatorial space of sequence processing programs appeals to you and you'd like to work at an exciting biotech, drop me a line or send your CV to email@example.com. We need you!