Memory-Efficient Gradient Unrolling for Large-Scale Bi-level Optimization

Guardat en:
Dades bibliogràfiques
Publicat a:arXiv.org (Dec 24, 2024), p. n/a
Autor principal: Shen, Qianli
Altres autors: Wang, Yezhen, Yang, Zhouhao, Li, Xiang, Wang, Haonan, Zhang, Yang, Scarlett, Jonathan, Zhu, Zhanxing, Kawaguchi, Kenji
Publicat:
Cornell University Library, arXiv.org
Matèries:
Accés en línia:Citation/Abstract
Full text outside of ProQuest
Etiquetes: Afegir etiqueta
Sense etiquetes, Sigues el primer a etiquetar aquest registre!

MARC

LEADER 00000nab a2200000uu 4500
001 3070859081
003 UK-CbPIL
022 |a 2331-8422 
035 |a 3070859081 
045 0 |b d20241224 
100 1 |a Shen, Qianli 
245 1 |a Memory-Efficient Gradient Unrolling for Large-Scale Bi-level Optimization 
260 |b Cornell University Library, arXiv.org  |c Dec 24, 2024 
513 |a Working Paper 
520 3 |a Bi-level optimization (BO) has become a fundamental mathematical framework for addressing hierarchical machine learning problems. As deep learning models continue to grow in size, the demand for scalable bi-level optimization solutions has become increasingly critical. Traditional gradient-based bi-level optimization algorithms, due to their inherent characteristics, are ill-suited to meet the demands of large-scale applications. In this paper, we introduce \(\textbf{F}\)orward \(\textbf{G}\)radient \(\textbf{U}\)nrolling with \(\textbf{F}\)orward \(\textbf{F}\)radient, abbreviated as \((\textbf{FG})^2\textbf{U}\), which achieves an unbiased stochastic approximation of the meta gradient for bi-level optimization. \((\text{FG})^2\text{U}\) circumvents the memory and approximation issues associated with classical bi-level optimization approaches, and delivers significantly more accurate gradient estimates than existing large-scale bi-level optimization approaches. Additionally, \((\text{FG})^2\text{U}\) is inherently designed to support parallel computing, enabling it to effectively leverage large-scale distributed computing systems to achieve significant computational efficiency. In practice, \((\text{FG})^2\text{U}\) and other methods can be strategically placed at different stages of the training process to achieve a more cost-effective two-phase paradigm. Further, \((\text{FG})^2\text{U}\) is easy to implement within popular deep learning frameworks, and can be conveniently adapted to address more challenging zeroth-order bi-level optimization scenarios. We provide a thorough convergence analysis and a comprehensive practical discussion for \((\text{FG})^2\text{U}\), complemented by extensive empirical evaluations, showcasing its superior performance in diverse large-scale bi-level optimization tasks. Code is available at https://github.com/ShenQianli/FG2U. 
653 |a Approximation 
653 |a Algorithms 
653 |a Deep learning 
653 |a System effectiveness 
653 |a Machine learning 
653 |a Optimization 
653 |a Distributed processing 
700 1 |a Wang, Yezhen 
700 1 |a Yang, Zhouhao 
700 1 |a Li, Xiang 
700 1 |a Wang, Haonan 
700 1 |a Zhang, Yang 
700 1 |a Scarlett, Jonathan 
700 1 |a Zhu, Zhanxing 
700 1 |a Kawaguchi, Kenji 
773 0 |t arXiv.org  |g (Dec 24, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3070859081/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2406.14095