Efficient Scheduling for GPU-Based Neural Network Training via Hybrid Reinforcement Learning and Metaheuristic Optimization

Guardado en:
书目详细资料
发表在:Big Data and Cognitive Computing vol. 9, no. 11 (2025), p. 284-325
主要作者: Du, Nana
其他作者: Wu, Chase, Hou Aiqin, Nie Weike, Song Ruiqi
出版:
MDPI AG
主题:
在线阅读:Citation/Abstract
Full Text + Graphics
Full Text - PDF
标签: 添加标签
没有标签, 成为第一个标记此记录!

MARC

LEADER 00000nab a2200000uu 4500
001 3275500653
003 UK-CbPIL
022 |a 2504-2289 
024 7 |a 10.3390/bdcc9110284  |2 doi 
035 |a 3275500653 
045 2 |b d20250101  |b d20251231 
100 1 |a Du, Nana  |u School of Computer, Northwest University, Xi’an 710100, China; dunana@stumail.nwu.edu.cn (N.D.); ruiqi_song@stumail.nwu.edu.cn (R.S.) 
245 1 |a Efficient Scheduling for GPU-Based Neural Network Training via Hybrid Reinforcement Learning and Metaheuristic Optimization 
260 |b MDPI AG  |c 2025 
513 |a Journal Article 
520 3 |a On GPU-based clusters, the training workloads of machine learning (ML) models, particularly neural networks (NNs), are often structured as Directed Acyclic Graphs (DAGs) and typically deployed for parallel execution across heterogeneous GPU resources. Efficient scheduling of these workloads is crucial for optimizing performance metrics such as execution time, under various constraints including GPU heterogeneity, network capacity, and data dependencies. DAG-structured ML workload scheduling could be modeled as a Nonlinear Integer Program (NIP) problem, and is shown to be NP-complete. By leveraging a positive correlation between Scheduling Plan Distance (SPD) and Finish Time Gap (FTG) identified through an empirical study, we propose to develop a Running Time Gap Strategy for scheduling based on Whale Optimization Algorithm (WOA) and Reinforcement Learning, referred to as WORL-RTGS. The proposed method integrates the global search capabilities of WOA with the adaptive decision-making of Double Deep Q-Networks (DDQN). Particularly, we derive a novel function to generate effective scheduling plans using DDQN, enhancing adaptability to complex DAG structures. Comprehensive evaluations on practical ML workload traces collected from Alibaba on simulated GPU-enabled platforms demonstrate that WORL-RTGS significantly improves WOA’s stability for DAG-structured ML workload scheduling and reduces completion time by up to 66.56% compared with five state-of-the-art scheduling algorithms. 
653 |a Scheduling 
653 |a Performance measurement 
653 |a Neural networks 
653 |a Integer programming 
653 |a Costs 
653 |a Graph theory 
653 |a Optimization 
653 |a Decision making 
653 |a Workload 
653 |a Algorithms 
653 |a Quality of service 
653 |a Machine learning 
653 |a Heuristic 
653 |a Completion time 
653 |a Workloads 
653 |a Heterogeneity 
653 |a Heuristic methods 
653 |a Run time (computers) 
700 1 |a Wu, Chase  |u Department of Data Science, New Jersey Institute of Technology, Newark, NJ 07102, USA; chase.wu@njit.edu 
700 1 |a Hou Aiqin  |u School of Computer, Northwest University, Xi’an 710100, China; dunana@stumail.nwu.edu.cn (N.D.); ruiqi_song@stumail.nwu.edu.cn (R.S.) 
700 1 |a Nie Weike  |u School of Computer, Northwest University, Xi’an 710100, China; dunana@stumail.nwu.edu.cn (N.D.); ruiqi_song@stumail.nwu.edu.cn (R.S.) 
700 1 |a Song Ruiqi  |u School of Computer, Northwest University, Xi’an 710100, China; dunana@stumail.nwu.edu.cn (N.D.); ruiqi_song@stumail.nwu.edu.cn (R.S.) 
773 0 |t Big Data and Cognitive Computing  |g vol. 9, no. 11 (2025), p. 284-325 
786 0 |d ProQuest  |t Advanced Technologies & Aerospace Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3275500653/abstract/embedded/6A8EOT78XXH2IG52?source=fedsrch 
856 4 0 |3 Full Text + Graphics  |u https://www.proquest.com/docview/3275500653/fulltextwithgraphics/embedded/6A8EOT78XXH2IG52?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3275500653/fulltextPDF/embedded/6A8EOT78XXH2IG52?source=fedsrch