Near Data Processing for Data-Intensive Machine Learning Workloads

Spremljeno u:
Bibliografski detalji
Izdano u:ProQuest Dissertations and Theses (2025)
Glavni autor: Wang, Yitu
Izdano:
ProQuest Dissertations & Theses
Teme:
Online pristup:Citation/Abstract
Full Text - PDF
Oznake: Dodaj oznaku
Bez oznaka, Budi prvi tko označuje ovaj zapis!

MARC

LEADER 00000nab a2200000uu 4500
001 3205822983
003 UK-CbPIL
020 |a 9798315715719 
035 |a 3205822983 
045 2 |b d20250101  |b d20251231 
084 |a 66569  |2 nlm 
100 1 |a Wang, Yitu 
245 1 |a Near Data Processing for Data-Intensive Machine Learning Workloads 
260 |b ProQuest Dissertations & Theses  |c 2025 
513 |a Dissertation/Thesis 
520 3 |a The rapid growth of data-driven applications—spanning personalized recommendation, large-scale machine learning, federated learning, and vector databases—places unprecedented pressure on traditional computer architectures. Conventional systems struggle under the intense memory bandwidth demands and irregular data access patterns of modern workloads. Moreover, limited concurrency in memory hierarchies, high off-chip data movement costs, and the difficulty of scaling to massive, heterogeneous deployments all contribute to performance bottlenecks and energy inefficiencies. These challenges are further amplified by the need for real-time or near-real-time processing, as well as the emerging requirement for large-capacity and high-throughput approximate nearest neighbor search (ANNS) in retrieval-augmented generation for large language models (LLMs). Collectively, these trends expose the critical limitations of existing architectures, revealing a clear need for near-data processing (NDP) solutions that bring computation closer to where data resides.In this dissertation, Near Data Processing for Data-Intensive Machine Learning Workloads, I present a suite of hardware–software co-design techniques to address these architectural challenges and optimize performance and energy efficiency for data-intensive tasks. First, I introduce ReRec, a ReRAM-based accelerator specialized for sparse embedding lookups in recommendation models. By refining crossbar designs and incorporating an access-aware mapping algorithm, ReRec achieves up to a 29.26× throughput improvement over CPU baselines while mitigating latency and resource under-utilization in fine-grained operations. Next, to tackle the significant gap between compute performance and data transfer speeds, I propose ICGMM, a hardware-driven cache management framework based on Gaussian Mixture Models for Compute Express Link (CXL)-based memory expansion. Prototyped on an FPGA, ICGMM dramatically reduces cache miss rates and average access latency compared to traditional caching policies, with markedly lower hardware overhead.Building on these insights, I develop EMS-I, an efficient memory system that integrates SSDs via CXL for large-scale recommendation models, such as DLRMs. By tailoring caching and prefetching mechanisms to data access patterns, EMS-I reduces memory costs while de livering performance comparable to state-of-the-art NDP solutions—at substantially lower energy consumption. Beyond recommendation tasks, I address the scalability and heterogeneity issues in federated learning through FedRepre, a framework that accelerates global model convergence via a bi-level active client selection strategy. Enhanced by a specialized server architecture and unified CXL-based memory pool, FedRepre reduces training time by up to 19.54× while improving model accuracy under real-world FL constraints.I also extend the co-design philosophy to NDSearch, an ANNS solution critical for vector databases and retrieval-augmented generation in LLMs. By leveraging a near-data processing architecture within NAND flash, NDSearch exploits internal parallelism to achieve speedups exceeding 30× over CPU baselines, alongside orders-of-magnitude gains in energy efficiency.Collectively, these projects illustrate how holistic hardware–software co-design and near-data processing strategies—encompassing ReRAM accelerators, in-storage computing, and CXL-based memory systems—can overcome the persistent challenges of bandwidth-intensive, latency-sensitive, and large-scale machine learning applications. This work provides a promising roadmap toward faster, more efficient, and more scalable computing systems in an era of ever-growing data demands. 
653 |a Computer engineering 
653 |a Computer science 
653 |a Electrical engineering 
773 0 |t ProQuest Dissertations and Theses  |g (2025) 
786 0 |d ProQuest  |t ProQuest Dissertations & Theses Global 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3205822983/abstract/embedded/L8HZQI7Z43R0LA5T?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3205822983/fulltextPDF/embedded/L8HZQI7Z43R0LA5T?source=fedsrch