Task Parallel Programming on the HammerBlade Manycore

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:ProQuest Dissertations and Theses (2025)
1. Verfasser: Ruttenberg, Max
Veröffentlicht:
ProQuest Dissertations & Theses
Schlagworte:
Online-Zugang:Citation/Abstract
Full Text - PDF
Tags: Tag hinzufügen
Keine Tags, Fügen Sie das erste Tag hinzu!

MARC

LEADER 00000nab a2200000uu 4500
001 3251632227
003 UK-CbPIL
020 |a 9798293850938 
035 |a 3251632227 
045 2 |b d20250101  |b d20251231 
084 |a 66569  |2 nlm 
100 1 |a Ruttenberg, Max 
245 1 |a Task Parallel Programming on the HammerBlade Manycore 
260 |b ProQuest Dissertations & Theses  |c 2025 
513 |a Dissertation/Thesis 
520 3 |a Manycore architectures integrate hundreds of cores on a single chip by using simple cores and simple memory systems usually based on software-managed scratchpad memories (SPMs). However, such architectures are notoriously challenging to program, since the programmers need to manually manage all aspects of data movement and synchronization for both correctness and performance. This manycore programmability challenge is one of the key barriers to achieving the promise of manycore architectures. Single program multiple data the de-facto standard parallel programming paradigm for manycore processors, not because the programming model is simple, but because its overheads are low. By contrast, the dynamic task parallel programming model has enjoyed considerable success in addressing the programmability challenge of multi-core processors with tens of complex cores and robust and coherent cache memory hierarchy. In this thesis, I focus on the HammerBlade manycore, and demonstrate that a work-stealing runtime is not just feasible on manycore architectures with SPMs, but such a runtime can also significantly improve the performance of irregular workloads when executing on these architectures. I also explore optimizations to leverage unused SPM space. This runtime framework achieves as much as 1.2–28.5× speedup on select workloads, and only induces minimal overheads. I show this runtime remains scalable up to a thousand-core system. Loss of locality can be mitigated by embedding locality-aware semantics to the scheduler scheduling while adding a minimum burden on the programmer. 
653 |a Computer science 
653 |a Computer engineering 
773 0 |t ProQuest Dissertations and Theses  |g (2025) 
786 0 |d ProQuest  |t ProQuest Dissertations & Theses Global 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3251632227/abstract/embedded/L8HZQI7Z43R0LA5T?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3251632227/fulltextPDF/embedded/L8HZQI7Z43R0LA5T?source=fedsrch