Optimizing Ring AllReduce for Sparse Data

Guardado en:
Detalles Bibliográficos
Publicado en:ProQuest Dissertations and Theses (2026)
Autor principal: Arunachalam, Anshul
Publicado:
ProQuest Dissertations & Theses
Materias:
Acceso en línea:Citation/Abstract
Full Text - PDF
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
Descripción
Resumen:The distributed training of machine learning models via gradient descent is generally conducted by iteratively computing the local gradients of a loss function and aggregating them across all processors. Communicating these gradients during aggregation is often a major cost but sparsification techniques can greatly improve efficiency. One such technique, Top-k gradient compression, ensures that only the k largest components of each local gradient are sent. However, effectively scaling this method can be challenging. The standard ring AllReduce algorithm, which is frequently used to aggregate dense gradients, lacks a counterpart that is optimized for sparse data. Notably, ring algorithms are contention-free, which generally make them easier to scale than other collective communication algorithms. Thus, in practice, the ring AllGather algorithm, which can be trivially adapted for sparse data, may be used instead, even though its bandwidth costs are proportional to the number of utilized processors (unlike ring AllReduce). To provide a more scalable contention-free alternative, we present a variant of ring AllReduce that has been better optimized for sparse data. We compare it to the standard dense ring AllReduce and ring AllGather algorithms, and we evaluate it empirically using gradients sampled from fine-tuning Llama 2 7b.
ISBN:9798270248345
Fuente:Publicly Available Content Database