UDF-Centric Dataflow Systems for Supporting User-Defined Functions in Collaborative Data Science, AI, and ML

Guardado en:

Detalles Bibliográficos
Publicado en:	ProQuest Dissertations and Theses (2025)
Autor principal:	Huang, Yicong
Publicado:	ProQuest Dissertations & Theses
Materias:	Computer science Artificial intelligence Computer engineering
Acceso en línea:	Citation/Abstract Full Text - PDF
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

Descripción
Resumen:	Data science tools, spanning from data collection to analysis and visualization, and leveraging advanced techniques such as artificial intelligence (AI), machine learning (ML), and large language models (LLMs), are now indispensable across a wide range of fields. Addressing today’s complex problems demands interdisciplinary collaboration among domain experts, data engineers, computer scientists, and statisticians, as no single field holds all the necessary expertise. There is an increasing demand for systems that let teams bring their own code across languages, collaborate modularly, inspect and interact with running computations at fine granularity, and manage heterogeneous resources in a resource-aware way. For the past few years, we have been building Texera, an open-source system to support collaborative data science using GUI-based workflows. This dissertation extends Texera with first-class support for user-defined functions (UDFs) and builds UDF-centric systems to meet these needs. We first present UDFlow, a framework for supporting UDFs in dataflow systems. It provides a unified API that supported tuple-, batch-, and table-oriented execution, enabling collaborators to express UDF logic at whatever granularity their task required. The API is also expressive enough to handle UDFs with multiple input ports and output ports. It allows collaborators to use Python, R, Scala, and Java UDFs together in a single workflow. We discuss execution support for host-language UDFs as well as foreign-language UDFs (e.g., Python, R) run in sidecar processes. We showcase the UDF UI and supporting services that provide an IDE-like experience to ease the development process of UDFs. We then propose Udon, a novel UDF debugger to support line-by-line debugging on dataflow systems. Udon allows users to set breakpoints, perform code inspections, and make code modifications while executing a UDF even on a single tuple. It includes a novel debug-aware UDF execution model to ensure the responsiveness of the operator during debugging. It utilizes advanced state-transfer techniques to satisfy breakpoint conditions that span across multiple UDFs. It incorporates various optimization techniques to reduce the runtime overhead. We conduct experiments with multiple UDF workloads on various datasets and show Udon’s high efficiency and scalability. We then present Peanut, a port-based framework for compilation & scheduling of UDFs in dataflow systems. Peanut converts a multi-port UDF as a DAG of mini-operators, called a U-plan. Each input and output port of such a UDF can be treated as a mini-operator, and the internal state is transferred via state edges between those mini-operators. Decoupling a monolithic UDF into a U-plan unlocks finer-grained parallelism, increased pipelined execution, resulting in higher resource utilization, which are the critical capabilities for resource-intensive data-science workloads. At the core of Peanut is a UDF compiler that automatically rewrites standard, multi-port Python UDFs into U-plans. We demonstrate that Peanut can effectively optimize a wide range of real-world UDFs, from machine learning training and inference to custom join implementations. Taken together, these contributions show that making UDFs first class requires an integrated stack that spans the interface, debugger, compiler, and execution runtime. Together, they advance the state of distributed dataflow systems toward accessible, efficient, and collaborative data science, AI, and ML.
ISBN:	9798290963273
Fuente:	ProQuest Dissertations & Theses Global