INTELLECT-1 Technical Report
Guardado en:
| Publicado en: | arXiv.org (Dec 2, 2024), p. n/a |
|---|---|
| Autor principal: | |
| Otros Autores: | , , , , , , , , , , |
| Publicado: |
Cornell University Library, arXiv.org
|
| Materias: | |
| Acceso en línea: | Citation/Abstract Full text outside of ProQuest |
| Etiquetas: |
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
MARC
| LEADER | 00000nab a2200000uu 4500 | ||
|---|---|---|---|
| 001 | 3138995643 | ||
| 003 | UK-CbPIL | ||
| 022 | |a 2331-8422 | ||
| 035 | |a 3138995643 | ||
| 045 | 0 | |b d20241202 | |
| 100 | 1 | |a Jaghouar, Sami | |
| 245 | 1 | |a INTELLECT-1 Technical Report | |
| 260 | |b Cornell University Library, arXiv.org |c Dec 2, 2024 | ||
| 513 | |a Working Paper | ||
| 520 | 3 | |a In this report, we introduce INTELLECT-1, the first 10 billion parameter language model collaboratively trained across the globe, demonstrating that large-scale model training is no longer confined to large corporations but can be achieved through a distributed, community-driven approach. INTELLECT-1 was trained on 1 trillion tokens using up to 14 concurrent nodes distributed across 3 continents, with contributions from 30 independent compute providers dynamically joining and leaving the training process, while maintaining 83-96% compute utilization and 36.2-41.4% model FLOPS utilization. We leverage PRIME, our scalable distributed training framework designed for fault-tolerant, high-performance training on unreliable, globally distributed nodes. Key innovations in PRIME include the ElasticDeviceMesh, which manages dynamic global process groups for fault-tolerant communication across the internet and local process groups for communication within a node, live checkpoint recovery kernels, and a hybrid DiLoCo-FSDP2 implementation. Using PRIME with DiLoCo and our custom int8 all-reduce, we achieve a 400x reduction in communication bandwidth compared to traditional data-parallel training settings while delivering comparable performance. These results demonstrate the feasibility and promise of training frontier foundation models in a decentralized network of global GPU resources. | |
| 653 | |a Communication | ||
| 653 | |a Scale models | ||
| 653 | |a Fault tolerance | ||
| 653 | |a Nodes | ||
| 700 | 1 | |a Ong, Jack Min | |
| 700 | 1 | |a Basra, Manveer | |
| 700 | 1 | |a Obeid, Fares | |
| 700 | 1 | |a Straube, Jannik | |
| 700 | 1 | |a Keiblinger, Michael | |
| 700 | 1 | |a Bakouch, Elie | |
| 700 | 1 | |a Atkins, Lucas | |
| 700 | 1 | |a Panahi, Maziyar | |
| 700 | 1 | |a Goddard, Charles | |
| 700 | 1 | |a Ryabinin, Max | |
| 700 | 1 | |a Hagemann, Johannes | |
| 773 | 0 | |t arXiv.org |g (Dec 2, 2024), p. n/a | |
| 786 | 0 | |d ProQuest |t Engineering Database | |
| 856 | 4 | 1 | |3 Citation/Abstract |u https://www.proquest.com/docview/3138995643/abstract/embedded/6A8EOT78XXH2IG52?source=fedsrch |
| 856 | 4 | 0 | |3 Full text outside of ProQuest |u http://arxiv.org/abs/2412.01152 |