Word segmentation of ancient Tamil text extracted from inscriptions

Guardado en:
Detalles Bibliográficos
Publicado en:Heritage Science vol. 13, no. 1 (Dec 2025), p. 97
Publicado:
Springer Nature B.V.
Materias:
Acceso en línea:Citation/Abstract
Full Text - PDF
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
Descripción
Resumen:The absence of word boundaries between words in scriptio continua script hinders the development of NLP models for such scripts. The objective of this research is to facilitate the building of NLP models for scriptio continua scripts by designing a word segmentation model for predicting word boundaries between characters in sentences, focusing particularly on ancient Tamil scripts. We have utilized an NGRAM Naive Bayes model to predict the existence of word boundaries between two characters in a scriptio continua text. We trained and assessed the model on a dataset of ancient Tamil writing, achieving an accuracy of 91.28%. Efficiently segmenting ancient Tamil texts not only helps preserve and comprehend historical manuscripts, but it also enables advancements in automated text segmentation. This model will assist archeologists in constructing NLP models utilizing ancient Tamil, allowing for the extraction of significant information from ancient Tamil manuscripts without the need for a language expert. Additional research may be undertaken to examine more effective techniques for word segmentation with better performance, managing scripts from several centuries, and developing models for additional scripts.
ISSN:2050-7445
DOI:10.1038/s40494-025-01612-2
Fuente:Materials Science Database