AI Computing Systems for Large Language Models Training

Guardado en:
Detalles Bibliográficos
Publicado en:Journal of Computer Science and Technology vol. 40, no. 1 (Jan 2025), p. 6
Publicado:
Springer Nature B.V.
Materias:
Acceso en línea:Citation/Abstract
Full Text - PDF
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!

MARC

LEADER 00000nab a2200000uu 4500
001 3176454874
003 UK-CbPIL
022 |a 1000-9000 
022 |a 1860-4749 
024 7 |a 10.1007/s11390-024-4178-1  |2 doi 
035 |a 3176454874 
045 2 |b d20250101  |b d20250131 
084 |a 137755  |2 nlm 
245 1 |a AI Computing Systems for Large Language Models Training 
260 |b Springer Nature B.V.  |c Jan 2025 
513 |a Journal Article 
520 3 |a In this paper, we present a comprehensive overview of artificial intelligence (AI) computing systems for large language models (LLMs) training. The rapid advancement of LLMs in recent years, coupled with the widespread adoption of algorithms and applications such as BERT, ChatGPT, and DeepSeek, has sparked significant interest in this field. We classify LLMs into encoder-only, encoder-decoder, and decoder-only models, and briefly analyze their training and inference processes to emphasize their substantial need for computational resources. These operations depend heavily on AI-specific accelerators like GPUs (graphics processing units), TPUs (tensor processing units), and MLUs (machine learning units). However, as the gap widens between the increasing complexity of LLMs and the current capabilities of accelerators, it becomes essential to adopt heterogeneous computing systems optimized for distributed environments to manage the growing computational and memory requirements of LLMs. We delve into the execution and scheduling of LLM algorithms, underlining the critical role of distributed computing strategies, memory management enhancements, and boosting computational efficiency. This paper clarifies the complex relationship between algorithm design, hardware infrastructure, and software optimization, and provides an in-depth understanding of both the software and hardware infrastructure supporting LLMs training, offering insights into the challenges and potential avenues for future development and deployment. 
653 |a Large language models 
653 |a Artificial intelligence 
653 |a Graphics processing units 
653 |a Hardware 
653 |a Tensors 
653 |a Algorithms 
653 |a Encoders-Decoders 
653 |a Complexity 
653 |a Machine learning 
653 |a Infrastructure 
653 |a Distributed memory 
653 |a Software 
653 |a Accelerators 
653 |a Distributed processing 
653 |a Memory management 
773 0 |t Journal of Computer Science and Technology  |g vol. 40, no. 1 (Jan 2025), p. 6 
786 0 |d ProQuest  |t ABI/INFORM Global 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3176454874/abstract/embedded/Y2VX53961LHR7RE6?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3176454874/fulltextPDF/embedded/Y2VX53961LHR7RE6?source=fedsrch