Detailansicht

Computational methods for fast and accurate phylogenetic inference

Lam Tung Nguyen

Art der Arbeit

Dissertation

Universität

Universität Wien

Fakultät

Fakultät für Lebenswissenschaften

Studiumsbezeichnung bzw. Universitätlehrgang (ULG)

Doctor of Philosophy-Doktoratsstudium NAWI Bereich Lebenswissenschaften (Dissertationsgebiet: Molekulare Biologie)

Betreuer*in

Arndt von Haeseler

Volltext herunterladen
Volltext in Browser öffnen

DOI

10.25365/thesis.41036

URN

urn:nbn:at:at-ubw:1-29047.89931.719253-8

Link zu u:search

(Print-Exemplar eventuell in Bibliothek verfügbar)

Abstracts

Abstract

(Deutsch)

Die von Charles Darwin aufgestellte Evolutionstheorie hat den Grundstein zu moderner biologischen Forschung gelegt. Die Rekonstruktion der Phylogenie (Stammbaum) aus molekularen Daten spielt die Hauptrolle bei der Aufklärung evolutionären Beziehungen. Mithilfe von günstigen Sequenzierungstechnologien werden jedes Jahr Unmengen von biologischen Daten produziert. Um mit der Geschwindigkeit der Datengenerierung Schritt zu halten, müssen Baumrekonstruktionsmethoden regelmäßig angepasst und verbessert werden. Diese Arbeit befasst sich mit der Entwicklung von effizienten und präzisen Algorithmen für die Rekonstruierung von Phylogenie mittels der Maximum-Likelihood-Methode. Als Erstes stellen wir einen schnellen und effektiven stochastischen Algorithmus zur Schätzung der Maximum-Likelihood-Stammbaum vor. Wir zeigen, dass eine Kombination von Bergsteigeransatz, Randomisierung und umfassende Probe der Suchraum zu bemerkenswerter Verbesserung der Baumsuche führt. Für die meisten getesteten Datensätzen fand unsere Software IQ-TREE bessere Bäume als die zwei weit bekannte Software RAxML und PhyML. Zweitens zeigen wir, dass die häufig verwendeten phylogenetische Software Parameter von der populären evolutionären Modell +I+Γ nicht verlässlich schätzen können. Wir haben festgestellt, dass die unzureichende Genauigkeit der implementierten Optimierungsroutinen die Ursache des Problems ist. Dabei schlagen wir eine alternative Optimierungsstrategie vor, welche die Genauigkeit der Schätzungen erheblich verbessert. Unsere Ermittlung unterstreicht die Wichtigkeit der Entwicklung geeigneter Schätzverfahren, insbesondere wenn immer mehr komplexe evolutionäre Modelle zum Einsatz kommen. Drittens erweitern wir den IQ-TREE Suchalgorithmus um eine effiziente Beschleunigungsheuristik. Wir verwenden hier einen datengetriebenen Ansatz um die stabilen Baumstrukturen zu erkennen. Dadurch können wir den Suchraum einschränken und die Suchzeit bis zu 3,9 Male reduzieren. Zum Schluss präsentieren wir eine effiziente MPI-Parallelisierung des IQ-TREE Suchalgorithmus. Alle Methoden sind in der Software IQ-TREE implementiert. IQ-TREE ist auf der folgenden Website erhältlich: http://www.cibiv.at/software/iqtree.

Abstract

(Englisch)

The theory of evolution, first popularized by Charles Darwin, laid the foundation of modern biological research. Reconstructing a phylogeny (evolutionary tree) from molecular data is one approach for understanding evolutionary relationships. The advent of high throughput and cheap sequencing technologies has led to the explosion of genetic sequences. To keep up with the speed of data generation, tree reconstruction methods need to be constantly improved. This thesis presents fast and accurate methods for inferring maximum-likelihood phylogenies. First, we describe a fast and effective stochastic search algorithm to find maximum- likelihood phylogenies. Our algorithm, implemented in the phylogenetic software IQ-TREE, employs hill-climbing search, stochastic perturbation and evolution strategy to sample local optima in the tree space. IQ-TREE performs favorably compared to state of the art methods such as RAxML and PhyML. If we allow the same CPU time as RAxML and PhyML, then IQ-TREE found higher likelihoods between 62.2% and 87.1% of the studied alignments, thus efficiently exploring the tree space. If we use the IQ-TREE stopping rule, RAxML and PhyML are faster in 75.7% and 47.1% of the DNA alignments and 42.2% and 100% of the protein alignments, respectively. However, the range of obtaining higher likelihoods with IQ-TREE improves to 73.3–97.1%. Second, we show that popular phylogenetic inference software cannot reliably estimate parameters of the widely used model of sequence evolution for rate heterogeneity +I+Γ. The inability to infer the true parameters is caused by inaccurate numerical optimization routines implemented in these programs. Hence, we pro- pose an alternative optimization strategy to improve the accuracy of estimates for the +I+Γ model. As more and more complex models of sequence evolution are being developed, our finding emphasizes the equal importance of developing suitable estimation methods. Third, we present the data-driven heuristic IQ-TREE-SP to shorten the tree search time. IQ-TREE-SP infers stable tree structures from the generated locally optimal trees to constrain the search space. Our computational results show that IQ- TREE-SP is up to 3.9 times faster than IQ-TREE, while at the same time produces better results. Finally, we present an MPI parallelization of the IQ-TREE search algorithm which exhibits very good scaling performance. The described methods are implemented in the software IQ-TREE, available at: http://www.cibiv.at/software/iqtree.

Autor*innen

Lam Tung Nguyen

Haupttitel (Englisch)

Computational methods for fast and accurate phylogenetic inference

Publikationsjahr

2016

Umfangsangabe

VIII, 91 Seiten : Illustrationen, Diagramme

Sprache

Englisch

Beurteiler*innen

Peter Arndt ,

Rolf Backofen

Klassifikationen

42 Biologie > 42.13 Molekularbiologie ,

42 Biologie > 42.21 Evolution ,

54 Informatik > 54.25 Parallele Datenverarbeitung ,

54 Informatik > 54.72 Künstliche Intelligenz ,

54 Informatik > 54.76 Computersimulation ,

54 Informatik > 54.80 Angewandte Informatik ,

54 Informatik > 54.81 Anwendungssoftware

AC Nummer

AC13059328

Utheses ID

36328

Studienkennzahl

UA | 794 | 685 | 490 |

Detailansicht

Abstracts

Schlagwörter