Real-Time Fuzzy Record-Matching Similarity Metric and Optimal Q-Gram Filter
Gardado en:
| Publicado en: | Algorithms vol. 18, no. 3 (2025), p. 150 |
|---|---|
| Autor Principal: | |
| Outros autores: | , , |
| Publicado: |
MDPI AG
|
| Materias: | |
| Acceso en liña: | Citation/Abstract Full Text + Graphics Full Text - PDF |
| Etiquetas: |
Sen Etiquetas, Sexa o primeiro en etiquetar este rexistro!
|
| Resumo: | In this paper, we introduce an advanced Fuzzy Record Similarity Metric (FRMS) that improves approximate record matching and models human perception of record similarity. The FRMS utilizes a newly developed similarity space with favorable properties combined with a metric space, employing a bag-of-words model with general applications in text mining and cluster analysis. To optimize the FRMS, we propose a two-stage method for approximate string matching and search that outperforms baseline methods in terms of average time complexity and F measure on various datasets. In the first stage, we construct an optimal Q-gram count filter as an optimal lower bound for fuzzy token similarities such as FRMS. The approximated Q-gram count filter achieves a high accuracy rate, filtering over 99% of dissimilar records, with a constant time complexity of <inline-formula>≈O(1)</inline-formula>. In the second stage, FRMS runs for a polynomial time of approximately <inline-formula>≈O(n4)</inline-formula> and models human perception of record similarity by maximum weight matching in a bipartite graph. The FRMS architecture has widespread applications in structured document storage such as databases and has already been commercialized by one of the largest IT companies. As a side result, we explain the behavior of the singularity of the Q-gram filter and the advantages of a padding extension. Overall, our method provides a more accurate and efficient approach to approximate string matching and search with real-time runtime. |
|---|---|
| ISSN: | 1999-4893 |
| DOI: | 10.3390/a18030150 |
| Fonte: | Engineering Database |