On Transparent Optimizations for Communication in Highly Parallel Systems

Uloženo v:

Podrobná bibliografie
Vydáno v:	ProQuest Dissertations and Theses (2024)
Hlavní autor:	Wilkins, Michael
Vydáno:	ProQuest Dissertations & Theses
Témata:	Computer engineering Computer science Communication
On-line přístup:	Citation/Abstract Full Text - PDF
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Abstrakt:	To leverage the omnipresent hardware parallelism in modern systems, applications must efficiently communicate across parallel tasks, e.g., to share data or control execution flow. The longstanding mechanisms for shared memory and distributed memory, i.e., coherence and message passing, remain the dominant choices to implement communication. I argue that these stalwart constructs can be transparently optimized, improving performance without exposing developers to the growing complexity of modern hardware that employ both shared and distributed memory. Then, I explore the ultimate ambition: a unified transparent communication abstraction across all memory types.In shared memory multiprocessors, communication is performed implicitly. Cache coherence maintains the abstraction of a single shared memory among hardware threads, so the application does not have to explicitly move data between them. However, coherence protocols incur an increasing overhead in modern hardware due to their conservative, reactive policies. I designed a new coherence protocol called WARDen to exploit the novel WARD property, which indicates large regions of memory that do not require fine-grained coherence. By transparently disabling the coherence protocol when it is unneeded, WARDen maintains the abstraction of shared memory and improves application performance by an average of 1.46x.In distributed memory machines, communication between memory domains is performed explicitly by the application. To specify the necessary communication, collective operations are the predominant primitive because they allow programmers to elegantly specify large-scale communication patterns in a single function call. The Message Passing Interface (MPI) is the de-facto standard for collectives in high-performance distributed memory systems like supercomputers. MPI libraries typically contain 3-4 implementations (i.e., algorithms) for each collective pattern.Despite their utility, collectives suffer performance degradation due to poor algorithm selection in the underlying MPI library. I created a series of autotuners named FACT and ACCLAiM that use machine learning (ML) to tractably find the optimal collective algorithms for large-scale applications. The autotuners are sometimes limited when all the available algorithms fail to properly leverage the underlying hardware. To address this issue, I developed a set of more flexible algorithms that can better map to complex, modern networks and increase the potency of autotuning. Combining these efforts on Frontier (the world’s fastest supercomputer at time of writing), I achieve speedups of over 4x compared to the proprietary vendor MPI library.Lastly, I explored my vision for a higher-level programming model that abstracts away communication altogether. I ported the popular NAS Parallel Benchmark Suite to an FMPL (Functional, Memory-managed, Parallel Language). I found that FMPLs have the potential to drastically improve transparency because the program does not need to be aware of communication at all. However, FMPLs are currently limited to shared memory machines. I built a prototype that extends an FMPL to distributed memory, charting the course to FMPLs in high-performance computing.Across these research thrusts, I developed novel optimizations for communication in high performance applications. Together, they show how existing communication abstractions, i.e., shared memory and message passing, can be transparently optimized, maintaining or even improving the level of abstraction exposed to the developer.
ISBN:	9798381977202
Zdroj:	ProQuest Dissertations & Theses Global