SynopsisDB: A Distributed Data System Supports In-System Data Exploration

Сохранить в:
Библиографические подробности
Опубликовано в::ProQuest Dissertations and Theses (2025)
Главный автор: Zhang, Xin
Опубликовано:
ProQuest Dissertations & Theses
Предметы:
Online-ссылка:Citation/Abstract
Full Text - PDF
Метки: Добавить метку
Нет меток, Требуется 1-ая метка записи!

MARC

LEADER 00000nab a2200000uu 4500
001 3271743352
003 UK-CbPIL
020 |a 9798263308971 
035 |a 3271743352 
045 2 |b d20250101  |b d20251231 
084 |a 66569  |2 nlm 
100 1 |a Zhang, Xin 
245 1 |a SynopsisDB: A Distributed Data System Supports In-System Data Exploration 
260 |b ProQuest Dissertations & Theses  |c 2025 
513 |a Dissertation/Thesis 
520 3 |a In the era of big data, domain experts commonly begin their analysis by exploring diverse datasets to gain meaningful insights. The concept of the Data Lake has emerged in recent years as a modern solution for storing and managing data from heterogeneous sources. It has quickly become the mainstream storage paradigm in industry, with widely adopted platforms such as Amazon Lake Formation, Azure Data Lake, and Google BigLake.In this thesis, we present a distributed data processing system named SynopsisDB, designed to support large-scale data exploration over data lakes. SynopsisDB consists of three layers: the storage layer, the query processing layer, and the user interface layer.The storage layer manages thousands of data files, combining storage engines of data lakes with a local Log-Structured Merge (LSM) tree–based engine. The data lake files are stored in the Hadoop Distributed File System (HDFS), while the local engine runs on a NewSQL database system that extends the leveled LSM-tree architecture, as Bi-LSM.The query processing layer features a component called SynopsisLake, which extends the Data Lakehouse architecture to manage and query thousands of data synopses. SynopsisLake bridges the gap between traditional query optimization techniques from Database Management Systems (DBMSs) and Data Warehouses and the heterogeneous, multi-resolution nature of data synopses in modern data lakes.The user interface layer supports three key operations: approximate query processing, progressive query processing, and progressive query visualization. These capabilities empower domain experts to efficiently explore their data, gain early insights, and interactively refine their queries over a short time.Together, these contributions make SynopsisDB a comprehensive and practical system for scalable, synopsis-driven data exploration in the age of big data. 
653 |a Computer science 
653 |a Computer engineering 
653 |a Applied mathematics 
653 |a Management 
773 0 |t ProQuest Dissertations and Theses  |g (2025) 
786 0 |d ProQuest  |t ProQuest Dissertations & Theses Global 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3271743352/abstract/embedded/6A8EOT78XXH2IG52?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3271743352/fulltextPDF/embedded/6A8EOT78XXH2IG52?source=fedsrch