PyKokkos: A Performance Portability Framework for Python

Guardat en:

Dades bibliogràfiques
Publicat a:	ProQuest Dissertations and Theses (2024)
Autor principal:	Al Awar, Nader
Publicat:	ProQuest Dissertations & Theses
Matèries:	Computer engineering Computer science Information technology
Accés en línia:	Citation/Abstract Full Text - PDF
Etiquetes:	Afegir etiqueta Sense etiquetes, Sigues el primer a etiquetar aquest registre!

Descripció
Resum:	High-performance computing (HPC) hardware is becoming increasingly heterogeneous, with most modern supercomputers containing different types of processors, such as central processing units (CPUs) and graphics processing units (GPUs), from a variety of different hardware vendors, such as NVIDIA, AMD, and Intel. To enable programmers to write software that extracts the maximum possible performance from their processors, hardware vendors typically provide programming frameworks that specifically target their own hardware.Developing software with these frameworks results in code that is tightly coupled to the targeted processor since the frameworks have different application programming interfaces (APIs) and usage guidelines. Using these frameworks, programmers write parallel, high-performance functions, which are known as kernels. The APIs allow programmers to interface with the processors while the usage guidelines provide directions on how to write kernel code that extracts the highest possible performance. Differences in these APIs and usage guidelines means that porting code from one type of processor to another requires considerable effort from programmers: they must rewrite their code to use the new framework’s API and learn its usage guidelines and best practices in order to achieve good performance on the new processor. Finally, they have to maintain two versions of the same code, one for each processor. As new processors and programming frameworks are constantly emerging, programmers must keep updating their code to take advantage of the new hardware and software, which is not a scalable approach to software development.An alternative approach is to use programming frameworks that enable writing code that runs on different types of processors with good performance, a concept known as performance portability. One such framework is Kokkos, a performance portable programming model with a C++ implementation which aims to provide a single API that runs efficiently on different hardware. While Kokkos achieves its goals of performance portability, its availability as a C++-only library negatively impacts usability. C++ is a powerful and widely used programming language but is notorious for being difficult to use. This is especially true for scientists with no formal training in software development, a group that forms a large portion of Kokkos’s user base. Instead, these users prefer higher level languages such as Python, a high-level, dynamically-typed, and interpreted language that has historically prioritized usability over performance.This dissertation presents PyKokkos, a Python framework for writing parallel performance portable kernels, as well as PyFuser, a kernel fusion framework which provides further speedups.Unlike C++ Kokkos, PyKokkos enables performance portability in Python by providing software abstractions that allows programmers to write their kernels entirely in Python. Internally, PyKokkos translates the Python kernel code to C++ Kokkos code, and automatically generates language bindings to allow for interoperability between Python and the generated C++ code. Using PyKokkos, we ported a number of existing C++ Kokkos examples to Python and showed that the PyKokkos kernels match the original kernels in terms of performance while being easier to write. These examples include ExaMiniMD, a ∼3k lines of code molecular dynamics miniapplication. Furthermore, PyKokkos achieves better performance than Numba, the state-of-the-art Python library for writing kernels.The dissertation then introduces PyFuser, a kernel fusion framework for PyKokkos. PyFuser first uses lazy evaluation to delay PyKokkos kernel execution and stores them in a trace. When the output of a kernel is accessed later, PyFuser automatically extracts the sequence of kernels that need to be executed to produce that output. PyFuser does not require any modifications to the PyKokkos code it operates on and is able to achieve speedups of 3.8× on average over the original unfused kernels.
ISBN:	9798310395800
Font:	ProQuest Dissertations & Theses Global