Optimizing Data Movement Performance and Energy Efficiency in Distributed Systems Under Shared Resource Constraints

Guardado en:
Detalles Bibliográficos
Publicado en:ProQuest Dissertations and Theses (2025)
Autor principal: Jamil, Md. Hasibul
Publicado:
ProQuest Dissertations & Theses
Materias:
Acceso en línea:Citation/Abstract
Full Text - PDF
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
Descripción
Resumen:The extensive growth in data-intensive science and industrial analytics has magnified the importance of achieving high-throughput and energy-efficient data movement over heterogeneous networks and compute environments. Existing solutions for data movement often rely on static, one-size-fits-all parameter configurations that cannot adapt to fluctuations in network bandwidth, end-system contention, or filesystem performance demands. Consequently, these approaches either fail to maximize throughput or incur substantial energy overheads.In our research, we present a family of novel solutions that jointly optimize data movement performance and energy consumption through cross-layer adaptations, spanning the application layer, kernel configurations, and runtime environments. First, we propose a two-phase decision-tree-based framework for uncertainty reduction to optimize throughput and energy efficiency in data transfer applications. Its offline component clusters historical data transfer logs to identify robust application and kernel parameters; subsequently, an online algorithm adapts concurrency, parallelism, CPU core allocation, and frequency scaling based on real-time conditions. This cross-layer solution demonstrates up to 117% higher throughput and 19% lower energy consumption compared to traditional methods.Recognizing the high cost of gathering environment-specific historical data and the need for a dedicated application-level solution for wide adaptability, we further introduce learning-based approaches that generalize across diverse network conditions without relying on extensive prior historical logs. By incorporating Deep Reinforcement Learning (DRL) and multi-parameter optimization, these frameworks dynamically adjust the number of parallel TCP streams and application-layer concurrency, yielding up to 25% throughput gains and 40% energy savings while converging 40% faster than conventional algorithms. Fairness and congestion avoidance mechanisms are also integrated to maintain stable network performance across competing flows.Building on these cross-layer, energy-aware principles, we then apply a similar concept to distributed machine learning I/O with efficient machine learning I/O (EMLIO). EMLIO co-locates lightweight daemons on storage nodes to pre-batch and serialize data shards from training data, move data over multi-stream TCP/ZeroMQ channels, and integrates seamlessly with GPU-accelerated preprocessing (e.g., NVIDIA DALI). In our evaluations, EMLIO delivers up to 8.6x faster I/O and 10.9x lower energy consumption compared to state-of-the-art ML loaders, while maintaining constant performance and energy profiles irrespective of network distance.Beyond bulk data transfers, we investigate end-to-end scientific data streaming under near-real-time constraints. Our NUMA-aware runtime system aligns memory-intensive tasks (e.g., compression) with local memory domains, thereby delivering up to a 1.48x throughput improvement over state-of-the-art methods and a 2.6x speedup over conventional approaches. We also develop FlowTracer, a tool to detect and correct imbalances in equal cost multi-path (ECMP) routing within leaf-spine networks, reducing path skew by 30% and alleviating throughput degradation specifically targeted for AI training workloads.Collectively, these contributions lay a robust groundwork for multi-objective optimization of data movement and distributed training in shared environments. By unifying cross-layer decision-tree methods, reinforcement-learning policies, energy-aware I/O services, NUMA-aware runtime designs, and multi-path route monitoring tools, significantly enhance through-put, reduce energy costs, and maintain fairness in large-scale, heterogeneous workloads.
ISBN:9798293833016
Fuente:ProQuest Dissertations & Theses Global