Detailansicht

Scalability and fault tolerance of ABFT methods for dense matrix multiplication

Svetoslav Inkolov

Art der Arbeit

Masterarbeit

Universität

Universität Wien

Fakultät

Fakultät für Informatik

Studiumsbezeichnung bzw. Universitätlehrgang (ULG)

Masterstudium Scientific Computing

Betreuer*in

Wilfried Gansterer

Volltext herunterladen
Volltext in Browser öffnen

DOI

10.25365/thesis.51789

URN

urn:nbn:at:at-ubw:1-30962.56692.269863-5

Link zu u:search

(Print-Exemplar eventuell in Bibliothek verfügbar)

Abstracts

Abstract

(Deutsch)

Die Idee der „algorithm-based fault tolerance“ (ABFT) ist nicht neu, sie hat ihren Ursprung in den frühen 80er Jahren. Diese Technik wird bei Berechnungen mit Matrizen angewendet, welche die Grundlage für vieler rechenintensive Aufgaben bilden. Da Supercomputer aus immer mehr Komponenten bestehen, nimmt ihre Komplexität insgesamt zu, und es entstehen viele Herausforderungen die bewältigt werden müssen. Daher wurde in den letzten Jahren die Notwendigkeit von umfassenden Fehlererkennungs- und Fehlerkorrekturalgorithmen immer wichtiger. Diese prekäre Situation ist hauptsächlich auf den Verlust der Stabilität, wenn viele Hardwarekomponenten in einem System zusammenkommen, zurückzuführen. In einem kleinen System oder einem durchschnittlichen Supercomputer sind Hardwareteile sogar über einen langen Zeitraum (Monate / Jahre) zuverlässig genug. Auf der anderen Seite, besitzen aktuelle Supercomputer (Petaflop-Bereich) Zehntausende von Rechenknoten, bei einer „mean time to interrupt“ (MTTI) von etwa einem Tag. Wenn wir die Berechnungen auf ein System im Exaflop-Maßstab (Supercomputer der nächsten Generation) erweitern, würde das in einer MTTI von etwa 1 Stunde resultieren. Da es in Exascale-Plattformen Millionen von Knoten geben kann, sollten die Möglichkeiten und Szenarien eines Systemausfalls gründlich getestet werden, bevor solche Systeme in der Realität gestartet werden. Der Fokus dieser Arbeit liegt auf der Untersuchung, wie effizient und zuverlässig ABFT-Methoden für dicht besetzte Matrizen implementiert werden können und ihr Verhalten gegenüber Exascale-Systemen abzuschätzen. Diese Untersuchung wird durchgeführt indem man sich auf die lokale ABFT-Methode konzentriert, bei der eine allgemeine Matrizenmultiplikation (MM) durchgeführt wird und mit Einfügungen von Bitflips während und nach der MM überprüft wird. Als wesentliche Grundlage für die Ergebnisse wurde DPLASMA, eine hochoptimierte Bibliothek für verteilte Hybridsysteme verwendet. Als Ergebnis haben wir, dass ein lokaler ABFT-Algorithmus in zukünftige Supercomputer verwendet werden sollte. Ein weiterer Teil dieser Arbeit konzentriert sich auf Simulatoren. Heutzutage existieren Simulatoren welche > 100.000 Rechenknoten mit mehreren Millionen Prozessoren, auf einem System das nur aus ein paar Dutzend Knoten besteht, darstellen können. Natürlich sind nicht alle Simulatoren in der Lage, alle möglichen Situationen zu simulieren, daher fokussiert sich die Studie auf die Zusammenfassung ihrer Vor- und Nachteile im Zusammenhang mit „High Performance Computing“ (HPC).

Abstract

(Englisch)

The idea of algorithm-based fault tolerance (ABFT) is not new, it has its origins in the early ’80s. This technique is used in computations with matrices which form the basis of many computationally-intensive tasks. As supercomputers consist of more and more components, their overall complexity increases, and many challenges arise that need to be handled. Therefore, the need for exhaustive fault detection and error correction algorithms become increasingly important over the last few years. This precarious situation is mainly due to loss of stability when many hardware components come together in one system. In a small system or an average supercomputer, hardware parts are reliable enough even over a long period of time (months/years). On the other hand, current supercomputers (petaflop range) have tens of thousands of computational nodes with a mean time to interrupt (MTTI) of about one day. If we expand the calculations to a system at exaflop scale (next-gen supercomputers), this will eventuate in an MTTI of about 1 hour. Since in exascale platforms, there can be millions of nodes, the possibilities and scenarios of failure should be thoroughly tested before launching such systems in reality. The focus of this thesis is to examine how efficient and reliable ABFT methods can be implemented to work on dense matrix operations and to estimate their behavior towards exascale systems. This investigation is done by concentrating on the Local ABFT method where a general matrix multiplication (GEMM) is performed and where it is tested against insertions of bit flips during and after the GEMM. As an essential basis for the results DPLASMA, a highly optimized library for distributed hybrid systems was used. As an outcome, we have that a Local ABFT algorithm should be used in future supercomputers. Another part of this thesis is concentrating on simulators. Nowadays exist simulators which can represent >100,000 of computational nodes with several million processors on a system which consists only of a few dozens of nodes. Of course, not all simulators are capable of simulating all possible situations, so the study focuses on the summary of their benefits and drawbacks in high-performance computing (HPC) context.

Autor*innen

Svetoslav Inkolov

Haupttitel (Englisch)

Scalability and fault tolerance of ABFT methods for dense matrix multiplication

Paralleltitel (Deutsch)

Skalierbarkeit und Fehlertoleranz von ABFT-Methoden für die dichtbesetzte Matrixmultiplikation

Publikationsjahr

2018

Umfangsangabe

XV, 126 Seiten : Illustrationen, Diagramme

Sprache

Englisch

Beurteiler*in

Wilfried Gansterer

Klassifikationen

31 Mathematik > 31.25 Lineare Algebra, multilineare Algebra ,

31 Mathematik > 31.76 Numerische Mathematik ,

54 Informatik > 54.25 Parallele Datenverarbeitung ,

54 Informatik > 54.80 Angewandte Informatik

AC Nummer

AC15422180

Utheses ID

45749

Studienkennzahl

UA | 066 | 940 | |

Detailansicht

Abstracts

Schlagwörter