Towards 3D Acceleration for low-power Mixture-of-Experts and Multi-Head Attention Spiking Transformers

Αποθηκεύτηκε σε:
Λεπτομέρειες βιβλιογραφικής εγγραφής
Εκδόθηκε σε:arXiv.org (Dec 7, 2024), p. n/a
Κύριος συγγραφέας: Xu, Boxun
Άλλοι συγγραφείς: Hwang, Junyoung, Vanna-iampikul, Pruek, Yin, Yuxuan, Lim, Sung Kyu, Li, Peng
Έκδοση:
Cornell University Library, arXiv.org
Θέματα:
Διαθέσιμο Online:Citation/Abstract
Full text outside of ProQuest
Ετικέτες: Προσθήκη ετικέτας
Δεν υπάρχουν, Καταχωρήστε ετικέτα πρώτοι!

MARC

LEADER 00000nab a2200000uu 4500
001 3142734164
003 UK-CbPIL
022 |a 2331-8422 
035 |a 3142734164 
045 0 |b d20241207 
100 1 |a Xu, Boxun 
245 1 |a Towards 3D Acceleration for low-power Mixture-of-Experts and Multi-Head Attention Spiking Transformers 
260 |b Cornell University Library, arXiv.org  |c Dec 7, 2024 
513 |a Working Paper 
520 3 |a Spiking Neural Networks(SNNs) provide a brain-inspired and event-driven mechanism that is believed to be critical to unlock energy-efficient deep learning. The mixture-of-experts approach mirrors the parallel distributed processing of nervous systems, introducing conditional computation policies and expanding model capacity without scaling up the number of computational operations. Additionally, spiking mixture-of-experts self-attention mechanisms enhance representation capacity, effectively capturing diverse patterns of entities and dependencies between visual or linguistic tokens. However, there is currently a lack of hardware support for highly parallel distributed processing needed by spiking transformers, which embody a brain-inspired computation. This paper introduces the first 3D hardware architecture and design methodology for Mixture-of-Experts and Multi-Head Attention spiking transformers. By leveraging 3D integration with memory-on-logic and logic-on-logic stacking, we explore such brain-inspired accelerators with spatially stackable circuitry, demonstrating significant optimization of energy efficiency and latency compared to conventional 2D CMOS integration. 
653 |a Brain 
653 |a Circuits 
653 |a Neural networks 
653 |a Mixtures 
653 |a Hardware 
653 |a Spiking 
653 |a Distributed processing 
653 |a Logic 
653 |a Network latency 
700 1 |a Hwang, Junyoung 
700 1 |a Vanna-iampikul, Pruek 
700 1 |a Yin, Yuxuan 
700 1 |a Lim, Sung Kyu 
700 1 |a Li, Peng 
773 0 |t arXiv.org  |g (Dec 7, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3142734164/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2412.05540