INTELLECT-1 Technical Report

Guardado en:

Detalles Bibliográficos
Publicado en:	arXiv.org (Dec 2, 2024), p. n/a
Autor principal:	Jaghouar, Sami
Otros Autores:	Ong, Jack Min, Basra, Manveer, Obeid, Fares, Straube, Jannik, Keiblinger, Michael, Bakouch, Elie, Atkins, Lucas, Panahi, Maziyar, Goddard, Charles, Ryabinin, Max, Hagemann, Johannes
Publicado:	Cornell University Library, arXiv.org
Materias:	Communication Scale models Fault tolerance Nodes
Acceso en línea:	Citation/Abstract Full text outside of ProQuest
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

MARC


LEADER	00000nab a2200000uu 4500
001	3138995643
003	UK-CbPIL
022			\|a 2331-8422
035			\|a 3138995643
045	0		\|b d20241202
100	1		\|a Jaghouar, Sami
245	1		\|a INTELLECT-1 Technical Report
260			\|b Cornell University Library, arXiv.org \|c Dec 2, 2024
513			\|a Working Paper
520	3		\|a In this report, we introduce INTELLECT-1, the first 10 billion parameter language model collaboratively trained across the globe, demonstrating that large-scale model training is no longer confined to large corporations but can be achieved through a distributed, community-driven approach. INTELLECT-1 was trained on 1 trillion tokens using up to 14 concurrent nodes distributed across 3 continents, with contributions from 30 independent compute providers dynamically joining and leaving the training process, while maintaining 83-96% compute utilization and 36.2-41.4% model FLOPS utilization. We leverage PRIME, our scalable distributed training framework designed for fault-tolerant, high-performance training on unreliable, globally distributed nodes. Key innovations in PRIME include the ElasticDeviceMesh, which manages dynamic global process groups for fault-tolerant communication across the internet and local process groups for communication within a node, live checkpoint recovery kernels, and a hybrid DiLoCo-FSDP2 implementation. Using PRIME with DiLoCo and our custom int8 all-reduce, we achieve a 400x reduction in communication bandwidth compared to traditional data-parallel training settings while delivering comparable performance. These results demonstrate the feasibility and promise of training frontier foundation models in a decentralized network of global GPU resources.
653			\|a Communication
653			\|a Scale models
653			\|a Fault tolerance
653			\|a Nodes
700	1		\|a Ong, Jack Min
700	1		\|a Basra, Manveer
700	1		\|a Obeid, Fares
700	1		\|a Straube, Jannik
700	1		\|a Keiblinger, Michael
700	1		\|a Bakouch, Elie
700	1		\|a Atkins, Lucas
700	1		\|a Panahi, Maziyar
700	1		\|a Goddard, Charles
700	1		\|a Ryabinin, Max
700	1		\|a Hagemann, Johannes
773	0		\|t arXiv.org \|g (Dec 2, 2024), p. n/a
786	0		\|d ProQuest \|t Engineering Database
856	4	1	\|3 Citation/Abstract \|u https://www.proquest.com/docview/3138995643/abstract/embedded/6A8EOT78XXH2IG52?source=fedsrch
856	4	0	\|3 Full text outside of ProQuest \|u http://arxiv.org/abs/2412.01152