Network-Aware Device Placement Search for Distributed Training
-д хадгалсан:
| -д хэвлэсэн: | ProQuest Dissertations and Theses (2025) |
|---|---|
| Үндсэн зохиолч: | |
| Хэвлэсэн: |
ProQuest Dissertations & Theses
|
| Нөхцлүүд: | |
| Онлайн хандалт: | Citation/Abstract Full Text - PDF |
| Шошгууд: |
Шошго байхгүй, Энэхүү баримтыг шошголох эхний хүн болох!
|
MARC
| LEADER | 00000nab a2200000uu 4500 | ||
|---|---|---|---|
| 001 | 3275489644 | ||
| 003 | UK-CbPIL | ||
| 020 | |a 9798263339616 | ||
| 035 | |a 3275489644 | ||
| 045 | 2 | |b d20250101 |b d20251231 | |
| 084 | |a 66569 |2 nlm | ||
| 100 | 1 | |a Venkata, Vishnu Varma | |
| 245 | 1 | |a Network-Aware Device Placement Search for Distributed Training | |
| 260 | |b ProQuest Dissertations & Theses |c 2025 | ||
| 513 | |a Dissertation/Thesis | ||
| 520 | 3 | |a As deep learning models grow in scale and complexity, efficient distributed training requires not only advanced parallelization strategies but also intelligent placement of model components across heterogeneous computing infrastructures. Existing device placement frameworks often assume simplified, uniform network topologies, leading to suboptimal performance in real-world data centers where communication costs vary significantly across nodes. I present my thesis on Network-aware, efficient device placement framework based on structured dynamic programming techniques (NEST). NEST jointly optimizes device placement and parallelism configuration by explicitly modeling the hierarchical and oversubscribed nature of modern data center networks. It supports a broad range of parallelization strategies–including tensor, pipeline, data, expert, and Zero Redundancy Optimizer (ZeRO) parallelism—and integrates detailed memory and communication cost modeling. Through structured dynamic programming, NEST explores the vast placement space efficiently and offers provable optimality guarantees within its search scope. Evaluations across realistic workloads and network settings show that NEST consistently outperforms manual and network-unaware baselines, delivering significant improvements in training throughput and resource utilization. | |
| 653 | |a Scheduling | ||
| 653 | |a Computer centers | ||
| 653 | |a Deep learning | ||
| 653 | |a Dynamic programming | ||
| 653 | |a Network topologies | ||
| 653 | |a Infrastructure | ||
| 653 | |a Graphs | ||
| 653 | |a Costs | ||
| 653 | |a Communication | ||
| 653 | |a Bandwidths | ||
| 653 | |a Optimization techniques | ||
| 653 | |a Linear programming | ||
| 653 | |a Workloads | ||
| 653 | |a Energy consumption | ||
| 653 | |a Markov analysis | ||
| 653 | |a Artificial intelligence | ||
| 653 | |a Computer science | ||
| 653 | |a Economics | ||
| 653 | |a Operations research | ||
| 773 | 0 | |t ProQuest Dissertations and Theses |g (2025) | |
| 786 | 0 | |d ProQuest |t ProQuest Dissertations & Theses Global | |
| 856 | 4 | 1 | |3 Citation/Abstract |u https://www.proquest.com/docview/3275489644/abstract/embedded/7BTGNMKEMPT1V9Z2?source=fedsrch |
| 856 | 4 | 0 | |3 Full Text - PDF |u https://www.proquest.com/docview/3275489644/fulltextPDF/embedded/7BTGNMKEMPT1V9Z2?source=fedsrch |