Workload-Aware Optimization for High-Throughput Log Analytics

Guardado en:

Detalles Bibliográficos
Publicado en:	ProQuest Dissertations and Theses (2025)
Autor principal:	Zhang, Ling
Publicado:	ProQuest Dissertations & Theses
Materias:	Computer science Systems science Information science
Acceso en línea:	Citation/Abstract Full Text - PDF
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

Descripción
Resumen:	In modern system management, it is critical to collect and analyze large volumes of log data. Regular expressions (regex) are the norm in the industry for extracting information from these logs. However, neither database systems (DBMSs) nor log analysis systems incorporate regex evaluation in their query optimization. Brute-force regex evaluation is computationally expensive, especially as log data grows into the petabyte range while hardware performance remains the same. Such an issue creates a critical bottleneck where traditional regex engines, designed for correctness and generality, cannot sustain the required throughput, while traditional full-text indexing solutions impose prohibitive storage overhead. This dissertation presents a workload-aware approach to efficient query processing in resource-constrained settings. We demonstrate that by exploiting the specific statistical characteristics of log analysis workloads, we can design specialized query engines and lightweight indexing structures that deliver order-of-magnitude performance gains with much less overhead space. We address the problem of optimizing log processing with regex under computational and strict space constraints by introducing BLARE, a framework that speeds up regex matching without requiring extra storage. BLARE treats regex evaluation as a query-planning problem. It breaks down complex regex queries into patterns and literals and uses multi-armed bandits to learn an effective splitting strategy from a small sample. This adaptive strategy can make things up to 168 times faster than standard libraries like RE2 and PCRE2. Second, we address the computational and memory constraints associated with regex evaluation by carefully examining different n-gram indexing methods, including their performance and overheads. We show that theoretically optimal selection algorithms incur prohibitive construction costs, negating their benefits. We demonstrate that different methods suit different types of workloads, and simple frequency-based heuristics yield a practical and robust solution with better scalability as data size grows. Finally, building on these insights, we introduce REI, a lightweight bit-vector indexing framework. By indexing query n-grams rather than the data, REI improves filtering performance while reducing index construction overhead by several orders of magnitude. These contributions show that exploiting the workload characteristic enables high-performance log analytics when hardware provisioning is constrained.
ISBN:	9798270242589
Fuente:	ProQuest Dissertations & Theses Global