Enhancing Distributed Machine Learning through Data Shuffling: Techniques, Challenges, and Implications

Сохранить в:
Библиографические подробности
Опубликовано в::ITM Web of Conferences vol. 73 (2025)
Главный автор: Zhang, Zikai
Опубликовано:
EDP Sciences
Предметы:
Online-ссылка:Citation/Abstract
Full Text - PDF
Метки: Добавить метку
Нет меток, Требуется 1-ая метка записи!

MARC

LEADER 00000nab a2200000uu 4500
001 3194619533
003 UK-CbPIL
022 |a 2431-7578 
022 |a 2271-2097 
024 7 |a 10.1051/itmconf/20257303018  |2 doi 
035 |a 3194619533 
045 2 |b d20250101  |b d20251231 
084 |a 268430  |2 nlm 
100 1 |a Zhang, Zikai 
245 1 |a Enhancing Distributed Machine Learning through Data Shuffling: Techniques, Challenges, and Implications 
260 |b EDP Sciences  |c 2025 
513 |a Conference Proceedings 
520 3 |a In distributed machine learning, data shuffling is a crucial data preprocessing technique that significantly impacts the efficiency and performance of model training. As distributed machine learning scales across multiple computing nodes, the ability to shuffle data effectively and efficiently has become essential for achieving high-quality model performance and minimizing communication costs. This paper systematically explores various data shuffling methods, including random shuffling, stratified shuffling, K-fold shuffling, and coded shuffling, each with distinct advantages, limitations, and application scenarios. Random shuffling is simple and fast but may lead to imbalanced class distributions, while stratified shuffling maintains class proportions at the cost of increased complexity. K-fold shuffling provides robust model evaluation through multiple training-validation splits, though it is computationally demanding. Coded shuffling, on the other hand, optimizes communication costs in distributed settings but requires sophisticated encoding-decoding techniques. The study also highlights the challenges associated with current shuffling techniques, such as handling class imbalance, high computational complexity, and adapting to dynamic, real-time data. This paper proposes potential solutions to enhance the efficacy of data shuffling, including hybrid methodologies, automated stratification processes, and optimized coding strategies. This work aims to guide future research on data shuffling in distributed machine learning environments, ultimately advancing model robustness and generalization across complex real-world applications. 
653 |a Machine learning 
653 |a Encoding-Decoding 
653 |a Complexity 
653 |a Real time 
773 0 |t ITM Web of Conferences  |g vol. 73 (2025) 
786 0 |d ProQuest  |t Advanced Technologies & Aerospace Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3194619533/abstract/embedded/L8HZQI7Z43R0LA5T?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3194619533/fulltextPDF/embedded/L8HZQI7Z43R0LA5T?source=fedsrch