[Beg-evodevo] Comparative genomics: Lining up is hard to do

http://www.nature.com/nrg/journal/v9/n8/full/nrg2420.html Fulltext: Nature Reviews Genetics 9, 573 (August 2008) | doi:10.1038/nrg2420 Comparative genomics: Lining up is hard to do Tanita Casci Comparisons between proteins or between DNA stretches tell a story about the evolutionary past. Unfortunately, however, which story they tell often depends on which method is used to line up the sequences. A new method claims to improve the accuracy of current alignments by making use not only of the similarity between sequences but also their phylogenetic history. Commonly used methods for aligning multiple sequences rely on sequential pairwise alignments between sequences --- the most closely related sequences are aligned first, followed by progressively more distant ones, all the way down the evolutionary tree. The problem with these so-called 'progressive algorithms' is that it is impossible to distinguish insertions from deletions when using only two sequences, causing biases that distort subsequent alignments. Traditional implementations of this algorithm set a higher penalty for insertions than for deletions and deem the former less likely to have occurred. These biases are particularly problematic when there are multiple neighbouring insertions or deletions --- for example, such insertions end up being aligned one under the other even though, by definition, they are not evolutionarily related. Misleading alignments caused by these false homologies are propagated, and lead to incorrect inferences about sequence evolution. Two sequences do not contain information about whether a length difference between them is caused by an insertion or a deletion --- however, in a multiple (as opposed to pairwise) alignment there are more than two sequences. The new phylogeny-aware algorithm uses outgroup information from related sequences to infer which of the two possible changes has happened. It does so by 'flagging' the gaps (length changes) made in earlier alignments so that they can be accounted for in future alignments without incurring a penalty. The re-use of a flagged gap also indicates that a particular length change was created by an insertion; the gap can then be labelled as a 'permanent' insertion and correctly kept distinct from any neighbouring ones. The new algorithm, called PRANK <http://www.ebi.ac.uk/goldman-srv/prank/>, was tested on simulated sequence data. Traditional methods were shown to introduce biases, and therefore misalignments, even when sequences are closely related. PRANK performs better than traditional methods in this situation and shows no bias as the sequences get more diverged and become harder to align. It is striking that traditional methods are error prone even when sequences get increasingly similar --- indicating that denser species sampling and more DNA sequencing cannot correct the algorithmic flaw in the alignment. Correct alignment remains a challenge, but less so than in the past. Why have programs traditionally performed so badly? The authors suggest that alignments were historically developed to compare protein sequences, in which any biases would be likely to cluster harmlessly in unconstrained regions. Given the increasingly widespread analysis of genomic DNA for understanding evolution there is a pressing need to get these methods up to scratch. Links WEB SITE * PRANK <http://www.ebi.ac.uk/goldman-srv/prank> Top of page <http://www.nature.com/nrg/journal/v9/n8/full/nrg2420.html#top> References and links ORIGINAL RESEARCH PAPER 1. Löytynoja, A & Goldman, N. Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science 320, 1632--1635 (2008) * Article <http://dx.doi.org/10.1126/science.1158395> * PubMed <http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?holding=npg&cmd=Retrieve&db=PubMed&list_uids=18566285&dopt=Abstract> * ChemPort <http://chemport.cas.org/cgi-bin/sdcgi?APP=ftslink&action=reflink&origin=npg&version=1.0&coi=1:CAS:528:DC%2BD1cXnt1Oju74%3D&pissn=1471-0056&pyear=2008&md5=a79f51b2820cec08c03d9813f15c2349> FURTHER READING

ja, ik wist daar al van. Heb met Nick Goldman gesproken toen ik op EBI was eerder dit jaar. Artikel was eerst onmiddellijk teruggestuurd, zonder review, maar Goldman dan gereclameert en nu dus gepubliceerd ... :-)
http://www.nature.com/nrg/journal/v9/n8/full/nrg2420.html
Fulltext:
Nature Reviews Genetics 9, 573 (August 2008) | doi:10.1038/nrg2420
Comparative genomics: Lining up is hard to do
Tanita Casci
Comparisons between proteins or between DNA stretches tell a story about the evolutionary past. Unfortunately, however, which story they tell often depends on which method is used to line up the sequences. A new method claims to improve the accuracy of current alignments by making use not only of the similarity between sequences but also their phylogenetic history.
Commonly used methods for aligning multiple sequences rely on sequential pairwise alignments between sequences — the most closely related sequences are aligned first, followed by progressively more distant ones, all the way down the evolutionary tree. The problem with these so-called 'progressive algorithms' is that it is impossible to distinguish insertions from deletions when using only two sequences, causing biases that distort subsequent alignments. Traditional implementations of this algorithm set a higher penalty for insertions than for deletions and deem the former less likely to have occurred. These biases are particularly problematic when there are multiple neighbouring insertions or deletions — for example, such insertions end up being aligned one under the other even though, by definition, they are not evolutionarily related. Misleading alignments caused by these false homologies are propagated, and lead to incorrect inferences about sequence evolution.
Two sequences do not contain information about whether a length difference between them is caused by an insertion or a deletion — however, in a multiple (as opposed to pairwise) alignment there are more than two sequences. The new phylogeny-aware algorithm uses outgroup information from related sequences to infer which of the two possible changes has happened. It does so by 'flagging' the gaps (length changes) made in earlier alignments so that they can be accounted for in future alignments without incurring a penalty. The re-use of a flagged gap also indicates that a particular length change was created by an insertion; the gap can then be labelled as a 'permanent' insertion and correctly kept distinct from any neighbouring ones.
The new algorithm, called PRANK <http://www.ebi.ac.uk/goldman-srv/prank/>, was tested on simulated sequence data. Traditional methods were shown to introduce biases, and therefore misalignments, even when sequences are closely related. PRANK performs better than traditional methods in this situation and shows no bias as the sequences get more diverged and become harder to align. It is striking that traditional methods are error prone even when sequences get increasingly similar — indicating that denser species sampling and more DNA sequencing cannot correct the algorithmic flaw in the alignment.
Correct alignment remains a challenge, but less so than in the past. Why have programs traditionally performed so badly? The authors suggest that alignments were historically developed to compare protein sequences, in which any biases would be likely to cluster harmlessly in unconstrained regions. Given the increasingly widespread analysis of genomic DNA for understanding evolution there is a pressing need to get these methods up to scratch.
Links
WEB SITE
* PRANK <http://www.ebi.ac.uk/goldman-srv/prank>
Top of page <http://www.nature.com/nrg/journal/v9/n8/full/nrg2420.html#top>
References and links
ORIGINAL RESEARCH PAPER
1.
Löytynoja, A & Goldman, N. Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science 320, 1632–1635 (2008)
* Article <http://dx.doi.org/10.1126/science.1158395> * PubMed <http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?holding=npg&cmd=Retrieve&db=PubMed&list_uids=18566285&dopt=Abstract> * ChemPort <http://chemport.cas.org/cgi-bin/sdcgi?APP=ftslink&action=reflink&origin=npg&version=1.0&coi=1:CAS:528:DC%2BD1cXnt1Oju74%3D&pissn=1471-0056&pyear=2008&md5=a79f51b2820cec08c03d9813f15c2349>
FURTHER READING
------------------------------------------------------------------------
_______________________________________________ Beg-evodevo mailing list Beg-evodevo@psb.ugent.be https://maillist.psb.ugent.be/mailman/listinfo/beg-evodevo
-- Yves Van de Peer, PhD. Professor in Bioinformatics and Genome Biology Group Leader Bioinformatics and Evolutionary Genomics VIB Department of Plant Systems Biology, UGent Ghent University Technologiepark 927 B-9052 Ghent Belgium Phone: +32 (0)9 331 3807 Cell Phone: +32 (0)476 560 091 Fax: +32 (0)9 331 3809 email: yves.vandepeer@psb.ugent.be http://bioinformatics.psb.ugent.be/
participants (2)
-
Klaas Vandepoele
-
Yves Van de Peer