Leveraging Machine Learning for Improved Distributed System Performance

Guardado en:
Detalles Bibliográficos
Publicado en:ProQuest Dissertations and Theses (2025)
Autor principal: Khalid, Hifza
Publicado:
ProQuest Dissertations & Theses
Materias:
Acceso en línea:Citation/Abstract
Full Text - PDF
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
Descripción
Resumen:Our research aims to optimize the performance of large distributed systems, which operate across multiple machines, by applying machine learning techniques. In our first project, we intended to use a large dataset of performance data for the Linux operating system to suggest optimal tuning for network applications in a client-server setting. We conducted a series of experiments to select hardware and Linux configuration options that are significant to network performance. Our results showed that network performance was mainly a function of workload and hardware. Investigating these results showed that our dataset did not contain enough diversity in configuration settings to infer the best tuning and was only useful for making hardware recommendations. Based on these experiments and their outcomes, we concluded that one should not take data diversity, even of huge datasets, for granted. We also recommend a set of preliminary tests for anyone working on similar complex datasets and planning to use machine learning.In our second project, we considered the problem of using a publicly released dataset by Alibaba to model the batch tasks that are often overlooked compared to online services. The dataset contains the arrivals and resource requirements (CPU, memory, etc.) for both batch and online tasks. Our trained model predicts, with high accuracy, the number of batch tasks that arrive in any 30 minute window, their associated CPU and memory requirements, and their lifetimes. It captures over 94% of arrivals in each 30 minute window within a 95% prediction interval. The F1scores for the most frequent CPU classes exceed 75%, and our memory and lifetime predictions incur less than 1% test data loss. The prediction accuracy of the lifetime of a batch-task drops when the model uses both CPU and memory information, as opposed to only using memory information.Our third project proposes a deep reinforcement learning approach to task scheduling, aiming to maximize cloud resource utilization by strategically delaying and consolidating batch tasks onto fewer machines. We explore Policy Gradient (REINFORCE) and Deep Q-Network (DDQN) methods to develop a self-learning scheduler that adapts to dynamic workload conditions. Experimental results show that REINFORCE increases average CPU and memory utilization by 125-200% compared to Best-Fit and Packer, efficiently reduces the number of machines required, and achieves a 5-30% reduction in resource fragmentation. Although DDQN also reduces machine usage compared to traditional methods, its performance declines under high loads due to job drops and sub-optimal long-term planning in partially observable environments. Moreover, REINFORCE is computationally more efficient with lower memory requirements, while DDQN is more sample efficient.
ISBN:9798304916066
Fuente:ProQuest Dissertations & Theses Global