INTELLECT-1 Technical Report

Guardado en:
Detalles Bibliográficos
Publicado en:arXiv.org (Dec 2, 2024), p. n/a
Autor principal: Jaghouar, Sami
Otros Autores: Ong, Jack Min, Basra, Manveer, Obeid, Fares, Straube, Jannik, Keiblinger, Michael, Bakouch, Elie, Atkins, Lucas, Panahi, Maziyar, Goddard, Charles, Ryabinin, Max, Hagemann, Johannes
Publicado:
Cornell University Library, arXiv.org
Materias:
Acceso en línea:Citation/Abstract
Full text outside of ProQuest
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!

MARC

LEADER 00000nab a2200000uu 4500
001 3138995643
003 UK-CbPIL
022 |a 2331-8422 
035 |a 3138995643 
045 0 |b d20241202 
100 1 |a Jaghouar, Sami 
245 1 |a INTELLECT-1 Technical Report 
260 |b Cornell University Library, arXiv.org  |c Dec 2, 2024 
513 |a Working Paper 
520 3 |a In this report, we introduce INTELLECT-1, the first 10 billion parameter language model collaboratively trained across the globe, demonstrating that large-scale model training is no longer confined to large corporations but can be achieved through a distributed, community-driven approach. INTELLECT-1 was trained on 1 trillion tokens using up to 14 concurrent nodes distributed across 3 continents, with contributions from 30 independent compute providers dynamically joining and leaving the training process, while maintaining 83-96% compute utilization and 36.2-41.4% model FLOPS utilization. We leverage PRIME, our scalable distributed training framework designed for fault-tolerant, high-performance training on unreliable, globally distributed nodes. Key innovations in PRIME include the ElasticDeviceMesh, which manages dynamic global process groups for fault-tolerant communication across the internet and local process groups for communication within a node, live checkpoint recovery kernels, and a hybrid DiLoCo-FSDP2 implementation. Using PRIME with DiLoCo and our custom int8 all-reduce, we achieve a 400x reduction in communication bandwidth compared to traditional data-parallel training settings while delivering comparable performance. These results demonstrate the feasibility and promise of training frontier foundation models in a decentralized network of global GPU resources. 
653 |a Communication 
653 |a Scale models 
653 |a Fault tolerance 
653 |a Nodes 
700 1 |a Ong, Jack Min 
700 1 |a Basra, Manveer 
700 1 |a Obeid, Fares 
700 1 |a Straube, Jannik 
700 1 |a Keiblinger, Michael 
700 1 |a Bakouch, Elie 
700 1 |a Atkins, Lucas 
700 1 |a Panahi, Maziyar 
700 1 |a Goddard, Charles 
700 1 |a Ryabinin, Max 
700 1 |a Hagemann, Johannes 
773 0 |t arXiv.org  |g (Dec 2, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3138995643/abstract/embedded/6A8EOT78XXH2IG52?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2412.01152