Wednesday, July 16, 2008

Forging into the gap

Gaps are important. There is a major brand by that name. Controversy over a perceived "missle gap" was a major issue in the Nixon-Kennedy election of 1960. Budget gaps cause governments to trim services. About a half an hour's drive west of where I grew up is the town of Gap, and a bunch of generations ago my ancestors probably passed through the Cumberland Gap.

Gaps occupy a special place in computational biology, specifically in the alignment of sequences and structures. As sequences evolve, they can acquire new residues (insertions) or lose residues (deletions), and so if we wish to align a pair of sequences we must put a gap in. Pairwise algorithms such as Needleman-Wunsch-Sellers and Smith-Waterman insert the optimal gaps -- given certain assumptions which include, but are not limited to, the match, mismatch, gap insertion and gap deletion penalties. Some pairwise alignment problems have been addressed by even more complicated gapping schemes. For example, if I am aligning a cDNA to a genomic sequence I may wish to have separate consideration of introns (a special case of gaps), gaps that would insert or remove multiples of three (codons) or gaps which don't all in either of those categories.

Multiple sequence alignment gets even harder. There are no exact algorithms to compute a guaranteed best alignment, so all methods have some degree of heuristics to them. Many algorithms are progressive, first aligning two sequences and then aligning another to that alignment and then another and so on, or perhaps aligning pairs of sequences and then aligning the aligned pairs and so on. Placement of gaps becomes especially tricky, as their placement in early alignments greatly influences the placement in later alignments, which could well be a bad thing.

Protein alignments in particular have the problem of trying to serve three masters, who are often but not always in agreement. An alignment can be a hypothesis of which parts of a protein serve the same role, a hypothesis as to which amino acids occupy similar positions in space, or a hypothesis as to which amino acids derive from codons with a shared ancestry. Particularly in the strongly conserved core of proteins these three are likely to be in agreement, but in the hinterlands of structural loops in proteins or disordered regions it's not so clear. There is also a bit of aesthetics that comes in; alignments just look neater and simpler when there are fewer gaps. Perhaps not quite Occam's Razor in action, but simplicity is appealing.

The June 20th issue of Science (yep, Science & Nature have been piling up) has a paper that addresses this issue and builds an algorithm unapologetically aligned to just the one goal: find the most plausible evolutionary history. They point out that while insertions and deletions are treated symmetrically by pairwise programs, they are quite asymmetric for progressive multiple alignment. The alignment gets to pay once for deleting something, but insertions (like overdue credit cards) incur a penalty with each successive alignment. It seems unlikely that nature works the same way, so this is undesirable.

One solution to this has been to have site-specific insertion penalties. Loytnoja & Goldman point out that this compensation often doesn't work and causes insertions to be aligned which are not homologous, in the sense that they each arose from a different event (indeed, these insertions should not be aligned with anything from an evolutionary point-of-view, though structurally or functionally an alignment is reasonable).

As an alternative, their method flags insertions made in early alignments so that they are treated specially in later alignments. The flagging scheme even allows insertions at the same position to be treated as independent -- they neither help nor penalize the alignment and are reported as separate entities.

Using synthetic data they tested their program against a number of other popular multiple aligners and found (surprise!) it did a better job of created the correct alignment. They also simulated what getting additional, intermediate data does for the alignments -- and scarily for the older alignment programs gap placement got worse (less reflective of the actual insertion/deletion history of the synthetic data).

The article closes with an interesting question: has our view of sequence evolution been shaped by incorrect algorithms? Is the dominant driver of sequence change in protein loops point mutants or small insertions/deletions.

Phylogeny-Aware Gap Placement Prevents Errors in Sequence Alignment and Evolutionary Analysis
Ari Löytynoja and Nick Goldman
p. 1632

No comments: