Real-Time Fuzzy Record-Matching Similarity Metric and Optimal Q-Gram Filter

Gardado en:
Detalles Bibliográficos
Publicado en:Algorithms vol. 18, no. 3 (2025), p. 150
Autor Principal: Rozinek, Ondřej
Outros autores: Marek, Jaroslav, Panuš, Jan, Mareš, Jan
Publicado:
MDPI AG
Materias:
Acceso en liña:Citation/Abstract
Full Text + Graphics
Full Text - PDF
Etiquetas: Engadir etiqueta
Sen Etiquetas, Sexa o primeiro en etiquetar este rexistro!
Descripción
Resumo:In this paper, we introduce an advanced Fuzzy Record Similarity Metric (FRMS) that improves approximate record matching and models human perception of record similarity. The FRMS utilizes a newly developed similarity space with favorable properties combined with a metric space, employing a bag-of-words model with general applications in text mining and cluster analysis. To optimize the FRMS, we propose a two-stage method for approximate string matching and search that outperforms baseline methods in terms of average time complexity and F measure on various datasets. In the first stage, we construct an optimal Q-gram count filter as an optimal lower bound for fuzzy token similarities such as FRMS. The approximated Q-gram count filter achieves a high accuracy rate, filtering over 99% of dissimilar records, with a constant time complexity of <inline-formula>≈O(1)</inline-formula>. In the second stage, FRMS runs for a polynomial time of approximately <inline-formula>≈O(n4)</inline-formula> and models human perception of record similarity by maximum weight matching in a bipartite graph. The FRMS architecture has widespread applications in structured document storage such as databases and has already been commercialized by one of the largest IT companies. As a side result, we explain the behavior of the singularity of the Q-gram filter and the advantages of a padding extension. Overall, our method provides a more accurate and efficient approach to approximate string matching and search with real-time runtime.
ISSN:1999-4893
DOI:10.3390/a18030150
Fonte:Engineering Database