Network-Aware Device Placement Search for Distributed Training

-д хадгалсан:
Номзүйн дэлгэрэнгүй
-д хэвлэсэн:ProQuest Dissertations and Theses (2025)
Үндсэн зохиолч: Venkata, Vishnu Varma
Хэвлэсэн:
ProQuest Dissertations & Theses
Нөхцлүүд:
Онлайн хандалт:Citation/Abstract
Full Text - PDF
Шошгууд: Шошго нэмэх
Шошго байхгүй, Энэхүү баримтыг шошголох эхний хүн болох!

MARC

LEADER 00000nab a2200000uu 4500
001 3275489644
003 UK-CbPIL
020 |a 9798263339616 
035 |a 3275489644 
045 2 |b d20250101  |b d20251231 
084 |a 66569  |2 nlm 
100 1 |a Venkata, Vishnu Varma 
245 1 |a Network-Aware Device Placement Search for Distributed Training 
260 |b ProQuest Dissertations & Theses  |c 2025 
513 |a Dissertation/Thesis 
520 3 |a As deep learning models grow in scale and complexity, efficient distributed training requires not only advanced parallelization strategies but also intelligent placement of model components across heterogeneous computing infrastructures. Existing device placement frameworks often assume simplified, uniform network topologies, leading to suboptimal performance in real-world data centers where communication costs vary significantly across nodes. I present my thesis on Network-aware, efficient device placement framework based on structured dynamic programming techniques (NEST). NEST jointly optimizes device placement and parallelism configuration by explicitly modeling the hierarchical and oversubscribed nature of modern data center networks. It supports a broad range of parallelization strategies–including tensor, pipeline, data, expert, and Zero Redundancy Optimizer (ZeRO) parallelism—and integrates detailed memory and communication cost modeling. Through structured dynamic programming, NEST explores the vast placement space efficiently and offers provable optimality guarantees within its search scope. Evaluations across realistic workloads and network settings show that NEST consistently outperforms manual and network-unaware baselines, delivering significant improvements in training throughput and resource utilization. 
653 |a Scheduling 
653 |a Computer centers 
653 |a Deep learning 
653 |a Dynamic programming 
653 |a Network topologies 
653 |a Infrastructure 
653 |a Graphs 
653 |a Costs 
653 |a Communication 
653 |a Bandwidths 
653 |a Optimization techniques 
653 |a Linear programming 
653 |a Workloads 
653 |a Energy consumption 
653 |a Markov analysis 
653 |a Artificial intelligence 
653 |a Computer science 
653 |a Economics 
653 |a Operations research 
773 0 |t ProQuest Dissertations and Theses  |g (2025) 
786 0 |d ProQuest  |t ProQuest Dissertations & Theses Global 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3275489644/abstract/embedded/7BTGNMKEMPT1V9Z2?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3275489644/fulltextPDF/embedded/7BTGNMKEMPT1V9Z2?source=fedsrch