Efficiency and scalability of fully-resolved fluid-particle simulations on heterogeneous CPU-GPU architectures

Guardado en:

Detalles Bibliográficos
Publicado en:	arXiv.org (Dec 9, 2024), p. n/a
Autor principal:	Kemmler, Samuel
Otros Autores:	Rettinger, Christoph, Rüde, Ulrich, Cuéllar, Pablo, Köstler, Harald
Publicado:	Cornell University Library, arXiv.org
Materias:	Simulation Fluidized beds Central processing units > CPUs Fluid dynamics Graphics processing units Hardware Supercomputers Run time (computers)
Acceso en línea:	Citation/Abstract Full text outside of ProQuest
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

MARC


LEADER	00000nab a2200000uu 4500
001	2789557366
003	UK-CbPIL
022			\|a 2331-8422
035			\|a 2789557366
045	0		\|b d20241209
100	1		\|a Kemmler, Samuel
245	1		\|a Efficiency and scalability of fully-resolved fluid-particle simulations on heterogeneous CPU-GPU architectures
260			\|b Cornell University Library, arXiv.org \|c Dec 9, 2024
513			\|a Working Paper
520	3		\|a Current supercomputers often have a heterogeneous architecture using both CPUs and GPUs. At the same time, numerical simulation tasks frequently involve multiphysics scenarios whose components run on different hardware due to multiple reasons, e.g., architectural requirements, pragmatism, etc. This leads naturally to a software design where different simulation modules are mapped to different subsystems of the heterogeneous architecture. We present a detailed performance analysis for such a hybrid four-way coupled simulation of a fully resolved particle-laden flow. The Eulerian representation of the flow utilizes GPUs, while the Lagrangian model for the particles runs on CPUs. First, a roofline model is employed to predict the node level performance and to show that the lattice-Boltzmann-based fluid simulation reaches very good performance on a single GPU. Furthermore, the GPU-GPU communication for a large-scale flow simulation results in only moderate slowdowns due to the efficiency of the CUDA-aware MPI communication, combined with communication hiding techniques. On 1024 A100 GPUs, a parallel efficiency of up to 71% is achieved. While the flow simulation has good performance characteristics, the integration of the stiff Lagrangian particle system requires frequent CPU-CPU communications that can become a bottleneck. Additionally, special attention is paid to the CPU-GPU communication overhead since this is essential for coupling the particles to the flow simulation. However, thanks to our problem-aware co-partitioning, the CPU-GPU communication overhead is found to be negligible. As a lesson learned from this development, four criteria are postulated that a hybrid implementation must meet for the efficient use of heterogeneous supercomputers. Additionally, an a priori estimate of the speedup for hybrid implementations is suggested.
653			\|a Simulation
653			\|a Fluidized beds
653			\|a Central processing units--CPUs
653			\|a Fluid dynamics
653			\|a Graphics processing units
653			\|a Hardware
653			\|a Supercomputers
653			\|a Run time (computers)
700	1		\|a Rettinger, Christoph
700	1		\|a Rüde, Ulrich
700	1		\|a Cuéllar, Pablo
700	1		\|a Köstler, Harald
773	0		\|t arXiv.org \|g (Dec 9, 2024), p. n/a
786	0		\|d ProQuest \|t Engineering Database
856	4	1	\|3 Citation/Abstract \|u https://www.proquest.com/docview/2789557366/abstract/embedded/L8HZQI7Z43R0LA5T?source=fedsrch
856	4	0	\|3 Full text outside of ProQuest \|u http://arxiv.org/abs/2303.11811