Bridging Fault Tolerance, Time Synchronization, and Performance Understanding Across Scalable Architectures and Applications

Guardado en:
Detalles Bibliográficos
Publicado en:ProQuest Dissertations and Theses (2025)
Autor principal: Nansamba, Grace
Publicado:
ProQuest Dissertations & Theses
Materias:
Acceso en línea:Citation/Abstract
Full Text - PDF
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!

MARC

LEADER 00000nab a2200000uu 4500
001 3285411607
003 UK-CbPIL
020 |a 9798270238964 
035 |a 3285411607 
045 2 |b d20250101  |b d20251231 
084 |a 66569  |2 nlm 
100 1 |a Nansamba, Grace 
245 1 |a Bridging Fault Tolerance, Time Synchronization, and Performance Understanding Across Scalable Architectures and Applications 
260 |b ProQuest Dissertations & Theses  |c 2025 
513 |a Dissertation/Thesis 
520 3 |a High Performance Computing (HPC) has become the backbone of scientific discovery and engineering innovation, yet increasing system complexity and scale amplify challenges in fault tolerance, time synchronization, and performance understanding. This dissertation presents an integrated study that bridges these three critical dimensions to enhance resilience, coordination, and efficiency across scalable architectures and applications.Fault tolerance for heterogeneous HPC systems is explored through CUDA kernel interruption using NVIDIA Multi-Process Service (MPS) and POSIX threads, as well as transparent checkpoint/restart with Application Binary Interface (ABI) portability. A taxonomy of consensus mechanisms adapted to MPI and HPC is introduced, classifying synchronous, asynchronous, and partially synchronous models, and aligning them with crash and Byzantine fault models.Time synchronization is addressed through the adaptation of the HUYGENS algorithm for HPC, yielding a lightweight, software-based clock correction method that requires no specialized hardware. The HPC-HUYGENS approach employs ring-based timestamp probes, and Support Vector Machine (SVM) classification to minimize skew and drift. Experimental evaluation demonstrates that this method improves collective predictability and reduces synchronization overhead compared to the traditional MPI_Barrier for reductions.Performance understanding is advanced through fine-grain profiling of communication patterns using Caliper, Benchpark, and Thicket. By instrumenting halo exchanges and collective regions in representative benchmarks (AMG2023, Kripke, Laghos), communication bottlenecks are identified, scaling inefficiencies quantified, and performance trade-offs assessed across architectures such as Intel Sapphire Rapids (Dane) and AMD MI250X (Tioga).The findings establish a conceptual framework for resilient and efficient largescale HPC. By bridging fault tolerance, time synchronization, and performance analysis, this work demonstrates that resilience, coordination/synchronization, and efficiency are not isolated goals but mutually reinforcing pillars for next-generation HPC systems. 
653 |a High temperature physics 
653 |a Computer engineering 
653 |a Computer science 
773 0 |t ProQuest Dissertations and Theses  |g (2025) 
786 0 |d ProQuest  |t ProQuest Dissertations & Theses Global 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3285411607/abstract/embedded/160PP4OP4BJVV2EV?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3285411607/fulltextPDF/embedded/160PP4OP4BJVV2EV?source=fedsrch