An Efficient Convolutional Neural Network Accelerator Design on FPGA Using the Layer-to-Layer Unified Input Winograd Architecture

Guardado en:

Detalles Bibliográficos
Publicado en:	Electronics vol. 14, no. 6 (2025), p. 1182
Autor principal:	Li, Jie
Otros Autores:	Liang, Yong, Yang, Zhenhao, Li, Xinhai
Publicado:	MDPI AG
Materias:	Xilinx Inc Fourier transforms Hardware Pipelining (computers) Artificial neural networks Neural networks Edge computing Network latency Decomposition Design Energy efficiency Embedded systems Computer vision Methods Algorithms Digital signal processors Artificial intelligence Consumption Field programmable gate arrays
Acceso en línea:	Citation/Abstract Full Text + Graphics Full Text - PDF
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

Descripción
Resumen:	Convolutional Neural Networks (CNNs) have found widespread applications in artificial intelligence fields such as computer vision and edge computing. However, as input data dimensionality and convolutional model depth continue to increase, deploying CNNs on edge and embedded devices faces significant challenges, including high computational demands, excessive hardware resource consumption, and prolonged computation times. In contrast, the Decomposable Winograd Method (DWM), which decomposes large-size or large-stride kernels into smaller kernels, provides a more efficient solution for inference acceleration in resource-constrained environments. This work proposes an approach employing the layer-to-layer unified input transformation based on the Decomposable Winograd Method. This reduces computational complexity in the feature transformation unit through system-level parallel pipelining and operation reuse. Additionally, we introduce a reconfigurable, column-indexed Winograd computation unit design to minimize hardware resource consumption. We also design flexible data access patterns to support efficient computation. Finally, we propose a preprocessing shift network system that enables low-latency data access and dynamic selection of the Winograd computation unit. Experimental evaluations on VGG-16 and ResNet-18 networks demonstrate that our accelerator, deployed on the Xilinx XC7Z045 platform, achieves an average throughput of 683.26 GOPS. Compared to existing approaches, the design improves DSP efficiency (GOPS/DSPs) by 5.8×.
ISSN:	2079-9292
DOI:	10.3390/electronics14061182
Fuente:	Advanced Technologies & Aerospace Database