INTELLECT-1 Technical Report

সংরক্ষণ করুন:
গ্রন্থ-পঞ্জীর বিবরন
প্রকাশিত:arXiv.org (Dec 2, 2024), p. n/a
প্রধান লেখক: Jaghouar, Sami
অন্যান্য লেখক: Ong, Jack Min, Basra, Manveer, Obeid, Fares, Straube, Jannik, Keiblinger, Michael, Bakouch, Elie, Atkins, Lucas, Panahi, Maziyar, Goddard, Charles, Ryabinin, Max, Hagemann, Johannes
প্রকাশিত:
Cornell University Library, arXiv.org
বিষয়গুলি:
অনলাইন ব্যবহার করুন:Citation/Abstract
Full text outside of ProQuest
ট্যাগগুলো: ট্যাগ যুক্ত করুন
কোনো ট্যাগ নেই, প্রথমজন হিসাবে ট্যাগ করুন!
বিবরন
সার সংক্ষেপ:In this report, we introduce INTELLECT-1, the first 10 billion parameter language model collaboratively trained across the globe, demonstrating that large-scale model training is no longer confined to large corporations but can be achieved through a distributed, community-driven approach. INTELLECT-1 was trained on 1 trillion tokens using up to 14 concurrent nodes distributed across 3 continents, with contributions from 30 independent compute providers dynamically joining and leaving the training process, while maintaining 83-96% compute utilization and 36.2-41.4% model FLOPS utilization. We leverage PRIME, our scalable distributed training framework designed for fault-tolerant, high-performance training on unreliable, globally distributed nodes. Key innovations in PRIME include the ElasticDeviceMesh, which manages dynamic global process groups for fault-tolerant communication across the internet and local process groups for communication within a node, live checkpoint recovery kernels, and a hybrid DiLoCo-FSDP2 implementation. Using PRIME with DiLoCo and our custom int8 all-reduce, we achieve a 400x reduction in communication bandwidth compared to traditional data-parallel training settings while delivering comparable performance. These results demonstrate the feasibility and promise of training frontier foundation models in a decentralized network of global GPU resources.
আইএসএসএন:2331-8422
সম্পদ:Engineering Database