Task Parallel Programming on the HammerBlade Manycore

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	ProQuest Dissertations and Theses (2025)
1. Verfasser:	Ruttenberg, Max
Veröffentlicht:	ProQuest Dissertations & Theses
Schlagworte:	Computer science Computer engineering
Online-Zugang:	Citation/Abstract Full Text - PDF
Tags:	Tag hinzufügen Keine Tags, Fügen Sie das erste Tag hinzu!

MARC


LEADER	00000nab a2200000uu 4500
001	3251632227
003	UK-CbPIL
020			\|a 9798293850938
035			\|a 3251632227
045	2		\|b d20250101 \|b d20251231
084			\|a 66569 \|2 nlm
100	1		\|a Ruttenberg, Max
245	1		\|a Task Parallel Programming on the HammerBlade Manycore
260			\|b ProQuest Dissertations & Theses \|c 2025
513			\|a Dissertation/Thesis
520	3		\|a Manycore architectures integrate hundreds of cores on a single chip by using simple cores and simple memory systems usually based on software-managed scratchpad memories (SPMs). However, such architectures are notoriously challenging to program, since the programmers need to manually manage all aspects of data movement and synchronization for both correctness and performance. This manycore programmability challenge is one of the key barriers to achieving the promise of manycore architectures. Single program multiple data the de-facto standard parallel programming paradigm for manycore processors, not because the programming model is simple, but because its overheads are low. By contrast, the dynamic task parallel programming model has enjoyed considerable success in addressing the programmability challenge of multi-core processors with tens of complex cores and robust and coherent cache memory hierarchy. In this thesis, I focus on the HammerBlade manycore, and demonstrate that a work-stealing runtime is not just feasible on manycore architectures with SPMs, but such a runtime can also significantly improve the performance of irregular workloads when executing on these architectures. I also explore optimizations to leverage unused SPM space. This runtime framework achieves as much as 1.2–28.5× speedup on select workloads, and only induces minimal overheads. I show this runtime remains scalable up to a thousand-core system. Loss of locality can be mitigated by embedding locality-aware semantics to the scheduler scheduling while adding a minimum burden on the programmer.
653			\|a Computer science
653			\|a Computer engineering
773	0		\|t ProQuest Dissertations and Theses \|g (2025)
786	0		\|d ProQuest \|t ProQuest Dissertations & Theses Global
856	4	1	\|3 Citation/Abstract \|u https://www.proquest.com/docview/3251632227/abstract/embedded/L8HZQI7Z43R0LA5T?source=fedsrch
856	4	0	\|3 Full Text - PDF \|u https://www.proquest.com/docview/3251632227/fulltextPDF/embedded/L8HZQI7Z43R0LA5T?source=fedsrch