Toward Effective Blocking for Entity Matching

Guardado en:
Detalles Bibliográficos
Publicado en:ProQuest Dissertations and Theses (2025)
Autor principal: Paulsen, Derek
Publicado:
ProQuest Dissertations & Theses
Materias:
Acceso en línea:Citation/Abstract
Full Text - PDF
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!

MARC

LEADER 00000nab a2200000uu 4500
001 3282833148
003 UK-CbPIL
020 |a 9798270220532 
035 |a 3282833148 
045 2 |b d20250101  |b d20251231 
084 |a 66569  |2 nlm 
100 1 |a Paulsen, Derek 
245 1 |a Toward Effective Blocking for Entity Matching 
260 |b ProQuest Dissertations & Theses  |c 2025 
513 |a Dissertation/Thesis 
520 3 |a In this dissertation we study entity matching, a fundamental problem that lies at the heart of data integration for data science and AI. Specifically, we consider the following common entity matching problem: given two tables A and B with the same schema, find all pairs of records (a, b) ∈ A x B that "match'", i.e., refer to the same real world entity.Typically, entity matching is done in two steps: blocking and matching. The goal of the blocking step is to quickly reduce the number of pairs to be processed later in the matching step, while retaining as many true matches as possible. The goal of the matching step is to accurately predict which records pairs match. In this dissertation we focus on the blocking step and make three major contributions.The first contribution is Sparkly, a novel TF-IDF based blocker built on top of Spark and Lucene, using a distributed shared-nothing architecture. The TF-IDF similarity measure is well known in the information retrieval literature but has received very little attention in entity matching research. In developing Sparkly, we explore TF-IDF based blocking for entity matching and demonstrate its effectiveness in a wide range of scenarios. Extensive experiments show that Sparkly outperforms eight state-of-the-art blockers, producing both higher recall and smaller candidate sets. Additionally, we ran Sparkly on over 100M tuples and demonstrate near-linear scale-out behavior.The second contribution is Delex, a system for combining blocking methods using a powerful declarative language. Delex is built on top of Spark. In real-world applications, users frequently want to combine multiple blocking methods to take advantage of the strengths of each method. Currently, combining multiple blocking methods is done in an ad-hoc way, which leads to both costly development and suboptimal performance. Delex is designed from the ground up for combining blocking methods using a scalable architecture. Experiments show that Delex can effectively optimize blocking plans, reducing runtime by up to threefold, and can scale to large datasets. In addition, we demonstrate the extensibility of Delex by implementing a new blocking method with only 150 lines of code.The third, and final, contribution is BigGoat, a benchmark for blocking for entity matching which mirrors how blockers are created for real-world applications. There are a wide variety of entity matching benchmark datasets. However few, if any, focus on scaling, with the majority of benchmark datasets containing fewer than 1M records. Due in part to this gap, many research blocking solutions cannot scale to large datasets, and hence are not practical for use in real-world applications. To address this problem we created the BigGoat benchmark. BigGoat consists of five realistic datasets with tables having up to 60M records. In creating BigGoat, we also develop a novel downsampling algorithm specifically designed for estimating the recall of blockers.Collectively, the contributions presented in this dissertation represent a significant advancement in the state of the art in blocking for entity matching. In addition, this work lays the foundation for future research aimed at further improving blocking algorithms and developing more robust evaluation methodologies. 
653 |a Computer science 
653 |a Computer engineering 
653 |a Artificial intelligence 
653 |a Information science 
773 0 |t ProQuest Dissertations and Theses  |g (2025) 
786 0 |d ProQuest  |t ProQuest Dissertations & Theses Global 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3282833148/abstract/embedded/L8HZQI7Z43R0LA5T?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3282833148/fulltextPDF/embedded/L8HZQI7Z43R0LA5T?source=fedsrch