Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training

Guardado en:
Detalles Bibliográficos
Publicado en:arXiv.org (Apr 22, 2022), p. n/a
Autor principal: Zhao, Mark
Otros Autores: Agarwal, Niket, Basant, Aarti, Gedik, Bugra, Pan, Satadru, Ozdal, Mustafa, Komuravelli, Rakesh, Pan, Jerry, Bao, Tianshu, Lu, Haowei, Sundaram Narayanan, Langman, Jack, Wilfong, Kevin, Rastogi, Harsha, Carole-Jean Wu, Kozyrakis, Christos, Parik Pol
Publicado:
Cornell University Library, arXiv.org
Materias:
Acceso en línea:Citation/Abstract
Full text outside of ProQuest
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!

MARC

LEADER 00000nab a2200000uu 4500
001 2563974412
003 UK-CbPIL
022 |a 2331-8422 
024 7 |a 10.1145/3470496.3533044  |2 doi 
035 |a 2563974412 
045 0 |b d20220422 
100 1 |a Zhao, Mark 
245 1 |a Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training 
260 |b Cornell University Library, arXiv.org  |c Apr 22, 2022 
513 |a Working Paper 
520 3 |a Datacenter-scale AI training clusters consisting of thousands of domain-specific accelerators (DSA) are used to train increasingly-complex deep learning models. These clusters rely on a data storage and ingestion (DSI) pipeline, responsible for storing exabytes of training data and serving it at tens of terabytes per second. As DSAs continue to push training efficiency and throughput, the DSI pipeline is becoming the dominating factor that constrains the overall training performance and capacity. Innovations that improve the efficiency and performance of DSI systems and hardware are urgent, demanding a deep understanding of DSI characteristics and infrastructure at scale. This paper presents Meta's end-to-end DSI pipeline, composed of a central data warehouse built on distributed storage and a Data PreProcessing Service that scales to eliminate data stalls. We characterize how hundreds of models are collaboratively trained across geo-distributed datacenters via diverse and continuous training jobs. These training jobs read and heavily filter massive and evolving datasets, resulting in popular features and samples used across training jobs. We measure the intense network, memory, and compute resources required by each training job to preprocess samples during training. Finally, we synthesize key takeaways based on our production infrastructure characterization. These include identifying hardware bottlenecks, discussing opportunities for heterogeneous DSI hardware, motivating research in datacenter scheduling and benchmark datasets, and assimilating lessons learned in optimizing DSI infrastructure. 
653 |a Preprocessing 
653 |a Ingestion 
653 |a Machine learning 
653 |a Storage 
653 |a Training 
653 |a Pipelines 
653 |a Nodes 
653 |a Data warehouses 
700 1 |a Agarwal, Niket 
700 1 |a Basant, Aarti 
700 1 |a Gedik, Bugra 
700 1 |a Pan, Satadru 
700 1 |a Ozdal, Mustafa 
700 1 |a Komuravelli, Rakesh 
700 1 |a Pan, Jerry 
700 1 |a Bao, Tianshu 
700 1 |a Lu, Haowei 
700 1 |a Sundaram Narayanan 
700 1 |a Langman, Jack 
700 1 |a Wilfong, Kevin 
700 1 |a Rastogi, Harsha 
700 1 |a Carole-Jean Wu 
700 1 |a Kozyrakis, Christos 
700 1 |a Parik Pol 
773 0 |t arXiv.org  |g (Apr 22, 2022), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/2563974412/abstract/embedded/75I98GEZK8WCJMPQ?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2108.09373