A Resilience Study of Soft Errors in High-Performance Computing Applications: Visualization and LLM-Based Modeling
Guardado en:
| Publicado en: | ProQuest Dissertations and Theses (2025) |
|---|---|
| Autor principal: | |
| Publicado: |
ProQuest Dissertations & Theses
|
| Materias: | |
| Acceso en línea: | Citation/Abstract Full Text - PDF |
| Etiquetas: |
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
| Resumen: | High-performance computing (HPC) drives modern scientific and engineering discovery, yet the ever-growing scale and density of contemporary architectures make them increasingly susceptible to transient bit flips that evade hardware countermeasures. When such soft errors escape detection, they may manifest as silent data corruptions (SDCs), compromising result integrity. Conventional defences—error-correcting codes or checkpoint/restart—struggle to scale gracefully to exascale workloads, leaving a widening resilience gap.This dissertation bridges that gap by combining large-language-model (LLM) analytics with interactive visualisation to create an end-to-end workflow that explains, predicts, and contextualises resilience in real HPC codes. First, the VISILIENCE framework overlays multiple resilience metrics on a program’s control-flow graph, enabling developers to trace error-propagation paths and prioritise mitigation without wading through raw logs. Building on those insights, the modular HAPPA predictor segments long kernels, embeds each segment with Transformer models, and aggregates them via mean, max, LSTM, or attention pooling, achieving state-of-the-art SDC-rate prediction accuracy. Its parameter-efficient successor, eHAPPA, adopts low-rank adaptation to cut trainable parameters by more than 99% while lowering mean-squared prediction error to 0.055—a 25% improvement over prior baselines. Finally, a loop-level study maps 52 benchmark applications to the thirteen “dwarfs” of parallel computation; guided by prompt-engineered GPT-4, it classifies loop semantics automatically and, through large-scale fault injection, uncovers pronounced dwarf-specific vulnerability patterns, with N-Body loops exceeding 30% SDC incidence and MapReduce loops remaining below 5%.Together these contributions form a scalable tool chain that converts low-level fault data into actionable insight, enabling selective hardening and advancing the dependability of next-generation HPC systems. |
|---|---|
| ISBN: | 9798288836077 |
| Fuente: | ProQuest Dissertations & Theses Global |