UniForm: A Reuse Attention Mechanism Optimized for Efficient Vision Transformers on Edge Devices
Shranjeno v:
| izdano v: | arXiv.org (Dec 3, 2024), p. n/a |
|---|---|
| Glavni avtor: | |
| Drugi avtorji: | |
| Izdano: |
Cornell University Library, arXiv.org
|
| Teme: | |
| Online dostop: | Citation/Abstract Full text outside of ProQuest |
| Oznake: |
Brez oznak, prvi označite!
|
MARC
| LEADER | 00000nab a2200000uu 4500 | ||
|---|---|---|---|
| 001 | 3140661897 | ||
| 003 | UK-CbPIL | ||
| 022 | |a 2331-8422 | ||
| 035 | |a 3140661897 | ||
| 045 | 0 | |b d20241203 | |
| 100 | 1 | |a Seul-Ki Yeom | |
| 245 | 1 | |a UniForm: A Reuse Attention Mechanism Optimized for Efficient Vision Transformers on Edge Devices | |
| 260 | |b Cornell University Library, arXiv.org |c Dec 3, 2024 | ||
| 513 | |a Working Paper | ||
| 520 | 3 | |a Transformer-based architectures have demonstrated remarkable success across various domains, but their deployment on edge devices remains challenging due to high memory and computational demands. In this paper, we introduce a novel Reuse Attention mechanism, tailored for efficient memory access and computational optimization, enabling seamless operation on resource-constrained platforms without compromising performance. Unlike traditional multi-head attention (MHA), which redundantly computes separate attention matrices for each head, Reuse Attention consolidates these computations into a shared attention matrix, significantly reducing memory overhead and computational complexity. Comprehensive experiments on ImageNet-1K and downstream tasks show that the proposed UniForm models leveraging Reuse Attention achieve state-of-the-art imagenet classification accuracy while outperforming existing attention mechanisms, such as Linear Attention and Flash Attention, in inference speed and memory scalability. Notably, UniForm-l achieves a 76.7% Top-1 accuracy on ImageNet-1K with 21.8ms inference time on edge devices like the Jetson AGX Orin, representing up to a 5x speedup over competing benchmark methods. These results demonstrate the versatility of Reuse Attention across high-performance GPUs and edge platforms, paving the way for broader real-time applications | |
| 653 | |a Accuracy | ||
| 653 | |a Attention | ||
| 653 | |a Memory tasks | ||
| 653 | |a Memory devices | ||
| 653 | |a Flash memory (computers) | ||
| 653 | |a Platforms | ||
| 653 | |a Real time | ||
| 653 | |a Task complexity | ||
| 653 | |a Inference | ||
| 700 | 1 | |a Tae-Ho, Kim | |
| 773 | 0 | |t arXiv.org |g (Dec 3, 2024), p. n/a | |
| 786 | 0 | |d ProQuest |t Engineering Database | |
| 856 | 4 | 1 | |3 Citation/Abstract |u https://www.proquest.com/docview/3140661897/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch |
| 856 | 4 | 0 | |3 Full text outside of ProQuest |u http://arxiv.org/abs/2412.02344 |