Ara2: Exploring Single- and Multi-Core Vector Processing with an Efficient RVV 1.0 Compliant Open-Source Processor

Guardat en:
Dades bibliogràfiques
Publicat a:arXiv.org (Jun 17, 2024), p. n/a
Autor principal: Perotti, Matteo
Altres autors: Cavalcante, Matheus, Andri, Renzo, Cavigelli, Lukas, Benini, Luca
Publicat:
Cornell University Library, arXiv.org
Matèries:
Accés en línia:Citation/Abstract
Full text outside of ProQuest
Etiquetes: Afegir etiqueta
Sense etiquetes, Sigues el primer a etiquetar aquest registre!

MARC

LEADER 00000nab a2200000uu 4500
001 2889794635
003 UK-CbPIL
022 |a 2331-8422 
024 7 |a 10.1109/TC.2024.3388896  |2 doi 
035 |a 2889794635 
045 0 |b d20240617 
100 1 |a Perotti, Matteo 
245 1 |a Ara2: Exploring Single- and Multi-Core Vector Processing with an Efficient RVV 1.0 Compliant Open-Source Processor 
260 |b Cornell University Library, arXiv.org  |c Jun 17, 2024 
513 |a Working Paper 
520 3 |a Vector processing is highly effective in boosting processor performance and efficiency for data-parallel workloads. In this paper, we present Ara2, the first fully open-source vector processor to support the RISC-V V 1.0 frozen ISA. We evaluate Ara2's performance on a diverse set of data-parallel kernels for various problem sizes and vector-unit configurations, achieving an average functional-unit utilization of 95% on the most computationally intensive kernels. We pinpoint performance boosters and bottlenecks, including the scalar core, memories, and vector architecture, providing insights into the main vector architecture's performance drivers. Leveraging the openness of the design, we implement Ara2 in a 22nm technology, characterize its PPA metrics on various configurations (2-16 lanes), and analyze its microarchitecture and implementation bottlenecks. Ara2 achieves a state-of-the-art energy efficiency of 37.8 DP-GFLOPS/W (0.8V) and 1.35GHz of clock frequency (critical path: ~40 FO4 gates). Finally, we explore the performance and energy-efficiency trade-offs of multi-core vector processors: we find that multiple vector cores help overcome the scalar core issue-rate bound that limits short-vector performance. For example, a cluster of eight 2-lane Ara2 (16 FPUs) achieves more than 3x better performance than a 16-lane single-core Ara2 (16 FPUs) when executing a 32x32x32 matrix multiplication, with 1.5x improved energy efficiency. 
653 |a Energy efficiency 
653 |a Vector processing (computers) 
653 |a Kernels 
653 |a Performance evaluation 
653 |a RISC 
653 |a Computer architecture 
653 |a Array processors 
653 |a Critical path 
653 |a Microprocessors 
653 |a Configurations 
700 1 |a Cavalcante, Matheus 
700 1 |a Andri, Renzo 
700 1 |a Cavigelli, Lukas 
700 1 |a Benini, Luca 
773 0 |t arXiv.org  |g (Jun 17, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/2889794635/abstract/embedded/H09TXR3UUZB2ISDL?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2311.07493