Real-Time Fuzzy Record-Matching Similarity Metric and Optimal Q-Gram Filter

Gardado en:
Detalles Bibliográficos
Publicado en:Algorithms vol. 18, no. 3 (2025), p. 150
Autor Principal: Rozinek, Ondřej
Outros autores: Marek, Jaroslav, Panuš, Jan, Mareš, Jan
Publicado:
MDPI AG
Materias:
Acceso en liña:Citation/Abstract
Full Text + Graphics
Full Text - PDF
Etiquetas: Engadir etiqueta
Sen Etiquetas, Sexa o primeiro en etiquetar este rexistro!

MARC

LEADER 00000nab a2200000uu 4500
001 3181337594
003 UK-CbPIL
022 |a 1999-4893 
024 7 |a 10.3390/a18030150  |2 doi 
035 |a 3181337594 
045 2 |b d20250101  |b d20251231 
084 |a 231333  |2 nlm 
100 1 |a Rozinek, Ondřej  |u Rozinet s.r.o., U Josefa 110, 532 10 Pardubice, Czech Republic; <email>ondrej.rozinek@rozinet.net</email>; Department of Information Technology, Faculty of Electrical Engineering and Informatics, University of Pardubice, Studentská 95, 532 10 Pardubice, Czech Republic; <email>jan.panus@upce.cz</email>; Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague, Thákurova 9, 166 34 Prague, Czech Republic 
245 1 |a Real-Time Fuzzy Record-Matching Similarity Metric and Optimal Q-Gram Filter 
260 |b MDPI AG  |c 2025 
513 |a Journal Article 
520 3 |a In this paper, we introduce an advanced Fuzzy Record Similarity Metric (FRMS) that improves approximate record matching and models human perception of record similarity. The FRMS utilizes a newly developed similarity space with favorable properties combined with a metric space, employing a bag-of-words model with general applications in text mining and cluster analysis. To optimize the FRMS, we propose a two-stage method for approximate string matching and search that outperforms baseline methods in terms of average time complexity and F measure on various datasets. In the first stage, we construct an optimal Q-gram count filter as an optimal lower bound for fuzzy token similarities such as FRMS. The approximated Q-gram count filter achieves a high accuracy rate, filtering over 99% of dissimilar records, with a constant time complexity of <inline-formula>≈O(1)</inline-formula>. In the second stage, FRMS runs for a polynomial time of approximately <inline-formula>≈O(n4)</inline-formula> and models human perception of record similarity by maximum weight matching in a bipartite graph. The FRMS architecture has widespread applications in structured document storage such as databases and has already been commercialized by one of the largest IT companies. As a side result, we explain the behavior of the singularity of the Q-gram filter and the advantages of a padding extension. Overall, our method provides a more accurate and efficient approach to approximate string matching and search with real-time runtime. 
610 4 |a Soundex 
653 |a Lower bounds 
653 |a Commercialization 
653 |a Software 
653 |a Similarity 
653 |a Search engines 
653 |a Datasets 
653 |a Deep learning 
653 |a Approximate string matching 
653 |a Ontology 
653 |a Assignment problem 
653 |a Polynomials 
653 |a Metric space 
653 |a Cluster analysis 
653 |a Natural language 
653 |a Time 
653 |a Document storage 
653 |a Graph theory 
653 |a Neural networks 
653 |a Optimization 
653 |a Perception 
653 |a Linear programming 
653 |a Methods 
653 |a Phonetics 
653 |a Algorithms 
653 |a Complexity 
653 |a Real time 
653 |a Semantics 
653 |a Databases 
653 |a Property 
653 |a Storage 
653 |a Matching 
653 |a Data mining 
700 1 |a Marek, Jaroslav  |u Department of Automation and Mathematics, Faculty of Electrical Engineering and Informatics, University of Pardubice, Studentská 95, 532 10 Pardubice, Czech Republic; &lt;email&gt;jan.mares@upce.cz&lt;/email&gt; 
700 1 |a Panuš, Jan  |u Department of Information Technology, Faculty of Electrical Engineering and Informatics, University of Pardubice, Studentská 95, 532 10 Pardubice, Czech Republic; &lt;email&gt;jan.panus@upce.cz&lt;/email&gt; 
700 1 |a Mareš, Jan  |u Department of Automation and Mathematics, Faculty of Electrical Engineering and Informatics, University of Pardubice, Studentská 95, 532 10 Pardubice, Czech Republic; &lt;email&gt;jan.mares@upce.cz&lt;/email&gt;; Department of Mathematics, Informatics and Cybernetics, University of Chemistry and Technology Prague, Technicka 5, 166 28 Prague, Czech Republic 
773 0 |t Algorithms  |g vol. 18, no. 3 (2025), p. 150 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3181337594/abstract/embedded/L8HZQI7Z43R0LA5T?source=fedsrch 
856 4 0 |3 Full Text + Graphics  |u https://www.proquest.com/docview/3181337594/fulltextwithgraphics/embedded/L8HZQI7Z43R0LA5T?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3181337594/fulltextPDF/embedded/L8HZQI7Z43R0LA5T?source=fedsrch