Deduplication Methods Using Levenshtein Distance Algorithm

محفوظ في:
التفاصيل البيبلوغرافية
الحاوية / القاعدة:Journal of Electrical Systems vol. 20, no. 7s (2024), p. 997
المؤلف الرئيسي: Valeriano, Eugene S
منشور في:
Engineering and Scientific Research Groups
الموضوعات:
الوصول للمادة أونلاين:Citation/Abstract
Full Text - PDF
الوسوم: إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
الوصف
مستخلص:The study aimed to propose methods to improve the data integrity of the Relational databases such as MS SQL, MySQL and PostgreSQL via record duplication detection. The FODORS and ZAGAT Restaurant database benchmark datasets have been utilized to facilitate the processes involved in preparing and delivering high-quality data. Furthermore, the Levenshtein distance algorithm was used to propose three (3) methods namely: default, eliminating equal string, and knowledge-based libraries to cut duplicate records in the database. In the 70% selected threshold, the average detected duplicate records of 88 out of 112 records in the restaurant dataset. Finally, to efficiently detect duplicate records in the database, depend on the data being analyzed and threshold selected.
تدمد:1112-5209
المصدر:Engineering Database