Bridging Fault Tolerance, Time Synchronization, and Performance Understanding Across Scalable Architectures and Applications

Guardado en:

Detalles Bibliográficos
Publicado en:	ProQuest Dissertations and Theses (2025)
Autor principal:	Nansamba, Grace
Publicado:	ProQuest Dissertations & Theses
Materias:	High temperature physics Computer engineering Computer science
Acceso en línea:	Citation/Abstract Full Text - PDF
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

MARC


LEADER	00000nab a2200000uu 4500
001	3285411607
003	UK-CbPIL
020			\|a 9798270238964
035			\|a 3285411607
045	2		\|b d20250101 \|b d20251231
084			\|a 66569 \|2 nlm
100	1		\|a Nansamba, Grace
245	1		\|a Bridging Fault Tolerance, Time Synchronization, and Performance Understanding Across Scalable Architectures and Applications
260			\|b ProQuest Dissertations & Theses \|c 2025
513			\|a Dissertation/Thesis
520	3		\|a High Performance Computing (HPC) has become the backbone of scientific discovery and engineering innovation, yet increasing system complexity and scale amplify challenges in fault tolerance, time synchronization, and performance understanding. This dissertation presents an integrated study that bridges these three critical dimensions to enhance resilience, coordination, and efficiency across scalable architectures and applications.Fault tolerance for heterogeneous HPC systems is explored through CUDA kernel interruption using NVIDIA Multi-Process Service (MPS) and POSIX threads, as well as transparent checkpoint/restart with Application Binary Interface (ABI) portability. A taxonomy of consensus mechanisms adapted to MPI and HPC is introduced, classifying synchronous, asynchronous, and partially synchronous models, and aligning them with crash and Byzantine fault models.Time synchronization is addressed through the adaptation of the HUYGENS algorithm for HPC, yielding a lightweight, software-based clock correction method that requires no specialized hardware. The HPC-HUYGENS approach employs ring-based timestamp probes, and Support Vector Machine (SVM) classification to minimize skew and drift. Experimental evaluation demonstrates that this method improves collective predictability and reduces synchronization overhead compared to the traditional MPI_Barrier for reductions.Performance understanding is advanced through fine-grain profiling of communication patterns using Caliper, Benchpark, and Thicket. By instrumenting halo exchanges and collective regions in representative benchmarks (AMG2023, Kripke, Laghos), communication bottlenecks are identified, scaling inefficiencies quantified, and performance trade-offs assessed across architectures such as Intel Sapphire Rapids (Dane) and AMD MI250X (Tioga).The findings establish a conceptual framework for resilient and efficient largescale HPC. By bridging fault tolerance, time synchronization, and performance analysis, this work demonstrates that resilience, coordination/synchronization, and efficiency are not isolated goals but mutually reinforcing pillars for next-generation HPC systems.
653			\|a High temperature physics
653			\|a Computer engineering
653			\|a Computer science
773	0		\|t ProQuest Dissertations and Theses \|g (2025)
786	0		\|d ProQuest \|t ProQuest Dissertations & Theses Global
856	4	1	\|3 Citation/Abstract \|u https://www.proquest.com/docview/3285411607/abstract/embedded/160PP4OP4BJVV2EV?source=fedsrch
856	4	0	\|3 Full Text - PDF \|u https://www.proquest.com/docview/3285411607/fulltextPDF/embedded/160PP4OP4BJVV2EV?source=fedsrch