Deep Learning Scheduling on a Field-Programmable Gate Array Cluster Using Configurable Deep Learning Accelerators

Tallennettuna:
Bibliografiset tiedot
Julkaisussa:Information vol. 16, no. 4 (2025), p. 298
Päätekijä: Fang Tianyang
Muut tekijät: Perez-Vicente, Alejandro, Johnson, Hans, Saniie Jafar
Julkaistu:
MDPI AG
Aiheet:
Linkit:Citation/Abstract
Full Text + Graphics
Full Text - PDF
Tagit: Lisää tagi
Ei tageja, Lisää ensimmäinen tagi!

MARC

LEADER 00000nab a2200000uu 4500
001 3194615459
003 UK-CbPIL
022 |a 2078-2489 
024 7 |a 10.3390/info16040298  |2 doi 
035 |a 3194615459 
045 2 |b d20250101  |b d20251231 
084 |a 231474  |2 nlm 
100 1 |a Fang Tianyang 
245 1 |a Deep Learning Scheduling on a Field-Programmable Gate Array Cluster Using Configurable Deep Learning Accelerators 
260 |b MDPI AG  |c 2025 
513 |a Journal Article 
520 3 |a This paper presents the development and evaluation of a distributed system employing low-latency embedded field-programmable gate arrays (FPGAs) to optimize scheduling for deep learning (DL) workloads and to configure multiple deep learning accelerator (DLA) architectures. Aimed at advancing FPGA applications in real-time edge computing, this study focuses on achieving optimal latency for a distributed computing system. A novel methodology was adopted, using configurable hardware to examine clusters of DLAs, varying in architecture and scheduling techniques. The system demonstrated its capability to parallel-process diverse neural network (NN) models, manage compute graphs in a pipelined sequence, and allocate computational resources efficiently to intensive NN layers. We examined five configurable DLAs—Versatile Tensor Accelerator (VTA), Nvidia DLA (NVDLA), Xilinx Deep Processing Unit (DPU), Tensil Compute Unit (CU), and Pipelined Convolutional Neural Network (PipeCNN)—across two FPGA cluster types consisting of Zynq-7000 and Zynq UltraScale+ System-on-Chip (SoC) processors, respectively. Four deep neural network (DNN) workloads were tested: Scatter-Gather, AI Core Assignment, Pipeline Scheduling, and Fused Scheduling. These methods revealed an exponential decay in processing time up to 90% speedup, although deviations were noted depending on the workload and cluster configuration. This research substantiates FPGAs’ utility in adaptable, efficient DL deployment, setting a precedent for future experimental configurations and performance benchmarks. 
610 4 |a Xilinx Inc 
653 |a Software 
653 |a Deep learning 
653 |a Configuration management 
653 |a Artificial neural networks 
653 |a Edge computing 
653 |a Field programmable gate arrays 
653 |a Clusters 
653 |a Automation 
653 |a Machine learning 
653 |a Workloads 
653 |a System on chip 
653 |a Distributed processing 
653 |a Efficiency 
653 |a Scheduling 
653 |a Pipelining (computers) 
653 |a Neural networks 
653 |a Optimization 
653 |a Tensors 
653 |a Network latency 
653 |a Workload 
653 |a Design 
653 |a Real time 
653 |a Computer networks 
700 1 |a Perez-Vicente, Alejandro 
700 1 |a Johnson, Hans 
700 1 |a Saniie Jafar 
773 0 |t Information  |g vol. 16, no. 4 (2025), p. 298 
786 0 |d ProQuest  |t Advanced Technologies & Aerospace Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3194615459/abstract/embedded/6A8EOT78XXH2IG52?source=fedsrch 
856 4 0 |3 Full Text + Graphics  |u https://www.proquest.com/docview/3194615459/fulltextwithgraphics/embedded/6A8EOT78XXH2IG52?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3194615459/fulltextPDF/embedded/6A8EOT78XXH2IG52?source=fedsrch