Memory-Efficient Gradient Unrolling for Large-Scale Bi-level Optimization
Guardat en:
| Publicat a: | arXiv.org (Dec 24, 2024), p. n/a |
|---|---|
| Autor principal: | |
| Altres autors: | , , , , , , , |
| Publicat: |
Cornell University Library, arXiv.org
|
| Matèries: | |
| Accés en línia: | Citation/Abstract Full text outside of ProQuest |
| Etiquetes: |
Sense etiquetes, Sigues el primer a etiquetar aquest registre!
|
MARC
| LEADER | 00000nab a2200000uu 4500 | ||
|---|---|---|---|
| 001 | 3070859081 | ||
| 003 | UK-CbPIL | ||
| 022 | |a 2331-8422 | ||
| 035 | |a 3070859081 | ||
| 045 | 0 | |b d20241224 | |
| 100 | 1 | |a Shen, Qianli | |
| 245 | 1 | |a Memory-Efficient Gradient Unrolling for Large-Scale Bi-level Optimization | |
| 260 | |b Cornell University Library, arXiv.org |c Dec 24, 2024 | ||
| 513 | |a Working Paper | ||
| 520 | 3 | |a Bi-level optimization (BO) has become a fundamental mathematical framework for addressing hierarchical machine learning problems. As deep learning models continue to grow in size, the demand for scalable bi-level optimization solutions has become increasingly critical. Traditional gradient-based bi-level optimization algorithms, due to their inherent characteristics, are ill-suited to meet the demands of large-scale applications. In this paper, we introduce \(\textbf{F}\)orward \(\textbf{G}\)radient \(\textbf{U}\)nrolling with \(\textbf{F}\)orward \(\textbf{F}\)radient, abbreviated as \((\textbf{FG})^2\textbf{U}\), which achieves an unbiased stochastic approximation of the meta gradient for bi-level optimization. \((\text{FG})^2\text{U}\) circumvents the memory and approximation issues associated with classical bi-level optimization approaches, and delivers significantly more accurate gradient estimates than existing large-scale bi-level optimization approaches. Additionally, \((\text{FG})^2\text{U}\) is inherently designed to support parallel computing, enabling it to effectively leverage large-scale distributed computing systems to achieve significant computational efficiency. In practice, \((\text{FG})^2\text{U}\) and other methods can be strategically placed at different stages of the training process to achieve a more cost-effective two-phase paradigm. Further, \((\text{FG})^2\text{U}\) is easy to implement within popular deep learning frameworks, and can be conveniently adapted to address more challenging zeroth-order bi-level optimization scenarios. We provide a thorough convergence analysis and a comprehensive practical discussion for \((\text{FG})^2\text{U}\), complemented by extensive empirical evaluations, showcasing its superior performance in diverse large-scale bi-level optimization tasks. Code is available at https://github.com/ShenQianli/FG2U. | |
| 653 | |a Approximation | ||
| 653 | |a Algorithms | ||
| 653 | |a Deep learning | ||
| 653 | |a System effectiveness | ||
| 653 | |a Machine learning | ||
| 653 | |a Optimization | ||
| 653 | |a Distributed processing | ||
| 700 | 1 | |a Wang, Yezhen | |
| 700 | 1 | |a Yang, Zhouhao | |
| 700 | 1 | |a Li, Xiang | |
| 700 | 1 | |a Wang, Haonan | |
| 700 | 1 | |a Zhang, Yang | |
| 700 | 1 | |a Scarlett, Jonathan | |
| 700 | 1 | |a Zhu, Zhanxing | |
| 700 | 1 | |a Kawaguchi, Kenji | |
| 773 | 0 | |t arXiv.org |g (Dec 24, 2024), p. n/a | |
| 786 | 0 | |d ProQuest |t Engineering Database | |
| 856 | 4 | 1 | |3 Citation/Abstract |u https://www.proquest.com/docview/3070859081/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch |
| 856 | 4 | 0 | |3 Full text outside of ProQuest |u http://arxiv.org/abs/2406.14095 |