So, a significant advance, but a bit unpleasant one if you are in the space. You now have several ugly options before you with regard to your prior data mapped to an earlier reference.
The do nothing option must appeal to some. Forgo the advantages of the new reference and just stick to the old. Perhaps start new projects on the new one, leading to a cacophony of internal tools dealing with different versions, with an ongoing risk of mismatched results. Also, cross your fingers that none of changes might be revised if analyzed against the new reference. Perhaps this route will be rationalized as healthy procrastination until a well-vetted set of graph-aware mappers exist, but once you start putting-off it is hard to stop doing so.
The other pole would be to embrace the new reference whole-heartedly and realign all the old data against the new reference. After burning a lot of compute cycles and storage space running in place, spend a lot of time reconciling old and new results. Then decide whether to ditch all your old alignments, or suffer an even larger storage burden.
A tempting shortcut would be to just remap alignments and variants by the known relationships between the two references. After all, the vast majority of the results will simply shift coordinates a bit, but with no other effects. In theory, one could estimate all the map regions that are now suspect and simply realign the reads which map to those regions, plus attempt to place reads that previously failed to map. Again reconciliation of results, but on a much reduced scale.
None would seem particularly appealing options. Perhaps that latter route will be a growth industry of new tools acting on BAM, CRAM or VCF which themselves will provide a morass of competing claims of accuracy, efficiency and speed. Doesn't make me at all in a hurry to leave a cozy world of haploid genomes that are often finished by a simple pipeline!
2 comments:
I agree with the short(-ish) cut alternative:
take all reads mapping to the questionable regions of older build AND all unmapped reads -> align them to the whole genome and call variants (perhaps use a 2-step approach for speed, like we did in our recent Plos paper)
- liftOver (or similar) for the already called variants on all other regions
The slides for her presentation are now online
Post a Comment