Method for portable, scalable, and performant GPU-accelerated simulation of multiphase compressible flow

Guardado en:

Detalles Bibliográficos
Publicado en:	arXiv.org (Feb 3, 2024), p. n/a
Autor principal:	Radhakrishnan, Anand
Otros Autores:	Henry Le Berre, Wilfong, Benjamin, Spratt, Jean-Sebastien, Rodriguez, Mauro, Jr, Colonius, Tim, Bryngelson, Spencer H
Publicado:	Cornell University Library, arXiv.org
Materias:	Multiphase Compressible flow Simulation Central processing units > CPUs Solvers Acceleration Kernels Portability Gas flow Damping
Acceso en línea:	Citation/Abstract Full text outside of ProQuest
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

MARC


LEADER	00000nab a2200000uu 4500
001	2814624527
003	UK-CbPIL
022			\|a 2331-8422
024	7		\|a 10.1016/j.cpc.2024.109238 \|2 doi
035			\|a 2814624527
045	0		\|b d20240203
100	1		\|a Radhakrishnan, Anand
245	1		\|a Method for portable, scalable, and performant GPU-accelerated simulation of multiphase compressible flow
260			\|b Cornell University Library, arXiv.org \|c Feb 3, 2024
513			\|a Working Paper
520	3		\|a Multiphase compressible flows are often characterized by a broad range of space and time scales. Thus entailing large grids and small time steps, simulations of these flows on CPU-based clusters can thus take several wall-clock days. Offloading the compute kernels to GPUs appears attractive but is memory-bound for standard finite-volume and -difference methods, damping speed-ups. Even when realized, faster GPU-based kernels lead to more intrusive communication and I/O times. We present a portable strategy for GPU acceleration of multiphase compressible flow solvers that addresses these challenges and obtains large speedups at scale. We use OpenACC for portable offloading of all compute kernels while maintaining low-level control when needed. An established Fortran preprocessor and metaprogramming tool, Fypp, enables otherwise hidden compile-time optimizations. This strategy exposes compile-time optimizations and high memory reuse while retaining readable, maintainable, and compact code. Remote direct memory access, realized via CUDA-aware MPI, reduces communication times. We implement this approach in the open-source solver MFC. Metaprogramming-based preprocessing results in an 8-times speedup of the most expensive kernels, 46% of peak FLOPs on NVIDIA GPUs, and high arithmetic intensity (about 10 FLOPs/byte). In representative simulations, a single A100 GPU is 300-times faster than an Intel Xeon CPU core, corresponding to a 9-times speedup for a single A100 compared to the entire CPU die. At the same time, near-ideal (97%) weak scaling is observed for at least 13824 GPUs on Summit. A strong scaling efficiency of 84% is retained for an 8-times increase in GPU count. Collective I/O, implemented via MPI3, helps ensure negligible contribution of data transfers. Large many-GPU simulations of compressible (solid-)liquid-gas flows demonstrate the practical utility of this strategy.
653			\|a Multiphase
653			\|a Compressible flow
653			\|a Simulation
653			\|a Central processing units--CPUs
653			\|a Solvers
653			\|a Acceleration
653			\|a Kernels
653			\|a Portability
653			\|a Gas flow
653			\|a Damping
700	1		\|a Henry Le Berre
700	1		\|a Wilfong, Benjamin
700	1		\|a Spratt, Jean-Sebastien
700	1		\|a Rodriguez, Mauro, Jr
700	1		\|a Colonius, Tim
700	1		\|a Bryngelson, Spencer H
773	0		\|t arXiv.org \|g (Feb 3, 2024), p. n/a
786	0		\|d ProQuest \|t Engineering Database
856	4	1	\|3 Citation/Abstract \|u https://www.proquest.com/docview/2814624527/abstract/embedded/7BTGNMKEMPT1V9Z2?source=fedsrch
856	4	0	\|3 Full text outside of ProQuest \|u http://arxiv.org/abs/2305.09163