Deep Learning Scheduling on a Field-Programmable Gate Array Cluster Using Configurable Deep Learning Accelerators

Tallennettuna:

Bibliografiset tiedot
Julkaisussa:	Information vol. 16, no. 4 (2025), p. 298
Päätekijä:	Fang Tianyang
Muut tekijät:	Perez-Vicente, Alejandro, Johnson, Hans, Saniie Jafar
Julkaistu:	MDPI AG
Aiheet:	Xilinx Inc Software Deep learning Configuration management Artificial neural networks Edge computing Field programmable gate arrays Clusters Automation Machine learning Workloads System on chip Distributed processing Efficiency Scheduling Pipelining (computers) Neural networks Optimization Tensors Network latency Workload Design Real time Computer networks
Linkit:	Citation/Abstract Full Text + Graphics Full Text - PDF
Tagit:	Lisää tagi Ei tageja, Lisää ensimmäinen tagi!

MARC


LEADER	00000nab a2200000uu 4500
001	3194615459
003	UK-CbPIL
022			\|a 2078-2489
024	7		\|a 10.3390/info16040298 \|2 doi
035			\|a 3194615459
045	2		\|b d20250101 \|b d20251231
084			\|a 231474 \|2 nlm
100	1		\|a Fang Tianyang
245	1		\|a Deep Learning Scheduling on a Field-Programmable Gate Array Cluster Using Configurable Deep Learning Accelerators
260			\|b MDPI AG \|c 2025
513			\|a Journal Article
520	3		\|a This paper presents the development and evaluation of a distributed system employing low-latency embedded field-programmable gate arrays (FPGAs) to optimize scheduling for deep learning (DL) workloads and to configure multiple deep learning accelerator (DLA) architectures. Aimed at advancing FPGA applications in real-time edge computing, this study focuses on achieving optimal latency for a distributed computing system. A novel methodology was adopted, using configurable hardware to examine clusters of DLAs, varying in architecture and scheduling techniques. The system demonstrated its capability to parallel-process diverse neural network (NN) models, manage compute graphs in a pipelined sequence, and allocate computational resources efficiently to intensive NN layers. We examined five configurable DLAs—Versatile Tensor Accelerator (VTA), Nvidia DLA (NVDLA), Xilinx Deep Processing Unit (DPU), Tensil Compute Unit (CU), and Pipelined Convolutional Neural Network (PipeCNN)—across two FPGA cluster types consisting of Zynq-7000 and Zynq UltraScale+ System-on-Chip (SoC) processors, respectively. Four deep neural network (DNN) workloads were tested: Scatter-Gather, AI Core Assignment, Pipeline Scheduling, and Fused Scheduling. These methods revealed an exponential decay in processing time up to 90% speedup, although deviations were noted depending on the workload and cluster configuration. This research substantiates FPGAs’ utility in adaptable, efficient DL deployment, setting a precedent for future experimental configurations and performance benchmarks.
610		4	\|a Xilinx Inc
653			\|a Software
653			\|a Deep learning
653			\|a Configuration management
653			\|a Artificial neural networks
653			\|a Edge computing
653			\|a Field programmable gate arrays
653			\|a Clusters
653			\|a Automation
653			\|a Machine learning
653			\|a Workloads
653			\|a System on chip
653			\|a Distributed processing
653			\|a Efficiency
653			\|a Scheduling
653			\|a Pipelining (computers)
653			\|a Neural networks
653			\|a Optimization
653			\|a Tensors
653			\|a Network latency
653			\|a Workload
653			\|a Design
653			\|a Real time
653			\|a Computer networks
700	1		\|a Perez-Vicente, Alejandro
700	1		\|a Johnson, Hans
700	1		\|a Saniie Jafar
773	0		\|t Information \|g vol. 16, no. 4 (2025), p. 298
786	0		\|d ProQuest \|t Advanced Technologies & Aerospace Database
856	4	1	\|3 Citation/Abstract \|u https://www.proquest.com/docview/3194615459/abstract/embedded/6A8EOT78XXH2IG52?source=fedsrch
856	4	0	\|3 Full Text + Graphics \|u https://www.proquest.com/docview/3194615459/fulltextwithgraphics/embedded/6A8EOT78XXH2IG52?source=fedsrch
856	4	0	\|3 Full Text - PDF \|u https://www.proquest.com/docview/3194615459/fulltextPDF/embedded/6A8EOT78XXH2IG52?source=fedsrch