Real-Time Fuzzy Record-Matching Similarity Metric and Optimal Q-Gram Filter
Gardado en:
| Publicado en: | Algorithms vol. 18, no. 3 (2025), p. 150 |
|---|---|
| Autor Principal: | |
| Outros autores: | , , |
| Publicado: |
MDPI AG
|
| Materias: | |
| Acceso en liña: | Citation/Abstract Full Text + Graphics Full Text - PDF |
| Etiquetas: |
Sen Etiquetas, Sexa o primeiro en etiquetar este rexistro!
|
MARC
| LEADER | 00000nab a2200000uu 4500 | ||
|---|---|---|---|
| 001 | 3181337594 | ||
| 003 | UK-CbPIL | ||
| 022 | |a 1999-4893 | ||
| 024 | 7 | |a 10.3390/a18030150 |2 doi | |
| 035 | |a 3181337594 | ||
| 045 | 2 | |b d20250101 |b d20251231 | |
| 084 | |a 231333 |2 nlm | ||
| 100 | 1 | |a Rozinek, Ondřej |u Rozinet s.r.o., U Josefa 110, 532 10 Pardubice, Czech Republic; <email>ondrej.rozinek@rozinet.net</email>; Department of Information Technology, Faculty of Electrical Engineering and Informatics, University of Pardubice, Studentská 95, 532 10 Pardubice, Czech Republic; <email>jan.panus@upce.cz</email>; Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague, Thákurova 9, 166 34 Prague, Czech Republic | |
| 245 | 1 | |a Real-Time Fuzzy Record-Matching Similarity Metric and Optimal Q-Gram Filter | |
| 260 | |b MDPI AG |c 2025 | ||
| 513 | |a Journal Article | ||
| 520 | 3 | |a In this paper, we introduce an advanced Fuzzy Record Similarity Metric (FRMS) that improves approximate record matching and models human perception of record similarity. The FRMS utilizes a newly developed similarity space with favorable properties combined with a metric space, employing a bag-of-words model with general applications in text mining and cluster analysis. To optimize the FRMS, we propose a two-stage method for approximate string matching and search that outperforms baseline methods in terms of average time complexity and F measure on various datasets. In the first stage, we construct an optimal Q-gram count filter as an optimal lower bound for fuzzy token similarities such as FRMS. The approximated Q-gram count filter achieves a high accuracy rate, filtering over 99% of dissimilar records, with a constant time complexity of <inline-formula>≈O(1)</inline-formula>. In the second stage, FRMS runs for a polynomial time of approximately <inline-formula>≈O(n4)</inline-formula> and models human perception of record similarity by maximum weight matching in a bipartite graph. The FRMS architecture has widespread applications in structured document storage such as databases and has already been commercialized by one of the largest IT companies. As a side result, we explain the behavior of the singularity of the Q-gram filter and the advantages of a padding extension. Overall, our method provides a more accurate and efficient approach to approximate string matching and search with real-time runtime. | |
| 610 | 4 | |a Soundex | |
| 653 | |a Lower bounds | ||
| 653 | |a Commercialization | ||
| 653 | |a Software | ||
| 653 | |a Similarity | ||
| 653 | |a Search engines | ||
| 653 | |a Datasets | ||
| 653 | |a Deep learning | ||
| 653 | |a Approximate string matching | ||
| 653 | |a Ontology | ||
| 653 | |a Assignment problem | ||
| 653 | |a Polynomials | ||
| 653 | |a Metric space | ||
| 653 | |a Cluster analysis | ||
| 653 | |a Natural language | ||
| 653 | |a Time | ||
| 653 | |a Document storage | ||
| 653 | |a Graph theory | ||
| 653 | |a Neural networks | ||
| 653 | |a Optimization | ||
| 653 | |a Perception | ||
| 653 | |a Linear programming | ||
| 653 | |a Methods | ||
| 653 | |a Phonetics | ||
| 653 | |a Algorithms | ||
| 653 | |a Complexity | ||
| 653 | |a Real time | ||
| 653 | |a Semantics | ||
| 653 | |a Databases | ||
| 653 | |a Property | ||
| 653 | |a Storage | ||
| 653 | |a Matching | ||
| 653 | |a Data mining | ||
| 700 | 1 | |a Marek, Jaroslav |u Department of Automation and Mathematics, Faculty of Electrical Engineering and Informatics, University of Pardubice, Studentská 95, 532 10 Pardubice, Czech Republic; <email>jan.mares@upce.cz</email> | |
| 700 | 1 | |a Panuš, Jan |u Department of Information Technology, Faculty of Electrical Engineering and Informatics, University of Pardubice, Studentská 95, 532 10 Pardubice, Czech Republic; <email>jan.panus@upce.cz</email> | |
| 700 | 1 | |a Mareš, Jan |u Department of Automation and Mathematics, Faculty of Electrical Engineering and Informatics, University of Pardubice, Studentská 95, 532 10 Pardubice, Czech Republic; <email>jan.mares@upce.cz</email>; Department of Mathematics, Informatics and Cybernetics, University of Chemistry and Technology Prague, Technicka 5, 166 28 Prague, Czech Republic | |
| 773 | 0 | |t Algorithms |g vol. 18, no. 3 (2025), p. 150 | |
| 786 | 0 | |d ProQuest |t Engineering Database | |
| 856 | 4 | 1 | |3 Citation/Abstract |u https://www.proquest.com/docview/3181337594/abstract/embedded/L8HZQI7Z43R0LA5T?source=fedsrch |
| 856 | 4 | 0 | |3 Full Text + Graphics |u https://www.proquest.com/docview/3181337594/fulltextwithgraphics/embedded/L8HZQI7Z43R0LA5T?source=fedsrch |
| 856 | 4 | 0 | |3 Full Text - PDF |u https://www.proquest.com/docview/3181337594/fulltextPDF/embedded/L8HZQI7Z43R0LA5T?source=fedsrch |