Method for portable, scalable, and performant GPU-accelerated simulation of multiphase compressible flow

Guardado en:
Detalles Bibliográficos
Publicado en:arXiv.org (Feb 3, 2024), p. n/a
Autor principal: Radhakrishnan, Anand
Otros Autores: Henry Le Berre, Wilfong, Benjamin, Spratt, Jean-Sebastien, Rodriguez, Mauro, Jr, Colonius, Tim, Bryngelson, Spencer H
Publicado:
Cornell University Library, arXiv.org
Materias:
Acceso en línea:Citation/Abstract
Full text outside of ProQuest
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!

MARC

LEADER 00000nab a2200000uu 4500
001 2814624527
003 UK-CbPIL
022 |a 2331-8422 
024 7 |a 10.1016/j.cpc.2024.109238  |2 doi 
035 |a 2814624527 
045 0 |b d20240203 
100 1 |a Radhakrishnan, Anand 
245 1 |a Method for portable, scalable, and performant GPU-accelerated simulation of multiphase compressible flow 
260 |b Cornell University Library, arXiv.org  |c Feb 3, 2024 
513 |a Working Paper 
520 3 |a Multiphase compressible flows are often characterized by a broad range of space and time scales. Thus entailing large grids and small time steps, simulations of these flows on CPU-based clusters can thus take several wall-clock days. Offloading the compute kernels to GPUs appears attractive but is memory-bound for standard finite-volume and -difference methods, damping speed-ups. Even when realized, faster GPU-based kernels lead to more intrusive communication and I/O times. We present a portable strategy for GPU acceleration of multiphase compressible flow solvers that addresses these challenges and obtains large speedups at scale. We use OpenACC for portable offloading of all compute kernels while maintaining low-level control when needed. An established Fortran preprocessor and metaprogramming tool, Fypp, enables otherwise hidden compile-time optimizations. This strategy exposes compile-time optimizations and high memory reuse while retaining readable, maintainable, and compact code. Remote direct memory access, realized via CUDA-aware MPI, reduces communication times. We implement this approach in the open-source solver MFC. Metaprogramming-based preprocessing results in an 8-times speedup of the most expensive kernels, 46% of peak FLOPs on NVIDIA GPUs, and high arithmetic intensity (about 10 FLOPs/byte). In representative simulations, a single A100 GPU is 300-times faster than an Intel Xeon CPU core, corresponding to a 9-times speedup for a single A100 compared to the entire CPU die. At the same time, near-ideal (97%) weak scaling is observed for at least 13824 GPUs on Summit. A strong scaling efficiency of 84% is retained for an 8-times increase in GPU count. Collective I/O, implemented via MPI3, helps ensure negligible contribution of data transfers. Large many-GPU simulations of compressible (solid-)liquid-gas flows demonstrate the practical utility of this strategy. 
653 |a Multiphase 
653 |a Compressible flow 
653 |a Simulation 
653 |a Central processing units--CPUs 
653 |a Solvers 
653 |a Acceleration 
653 |a Kernels 
653 |a Portability 
653 |a Gas flow 
653 |a Damping 
700 1 |a Henry Le Berre 
700 1 |a Wilfong, Benjamin 
700 1 |a Spratt, Jean-Sebastien 
700 1 |a Rodriguez, Mauro, Jr 
700 1 |a Colonius, Tim 
700 1 |a Bryngelson, Spencer H 
773 0 |t arXiv.org  |g (Feb 3, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/2814624527/abstract/embedded/7BTGNMKEMPT1V9Z2?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2305.09163