Low-Resourced Alphabet-Level Pivot-Based Neural Machine Translation for Translating Korean Dialects

Guardado en:

Detalles Bibliográficos
Publicado en:	Applied Sciences vol. 15, no. 17 (2025), p. 9459-9476
Autor principal:	Park, Junho
Otros Autores:	Park, Seong-Bae
Publicado:	MDPI AG
Materias:	Language Dialects Experiments Parallel corpora Machine translation Sequences Standard dialects Interpreters Chinese languages Japanese language Sentences Foreign languages Phonetics Translation Speech Large language models Korean language Morphology Alphabets Normalization Translation methods and strategies
Acceso en línea:	Citation/Abstract Full Text + Graphics Full Text - PDF
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

Descripción
Resumen:	Developing a machine translator from a Korean dialect to a foreign language presents significant challenges due to a lack of a parallel corpus for direct dialect translation. To solve this issue, this paper proposes a pivot-based machine translation model that consists of two sub-translators. The first sub-translator is a sequence-to-sequence model with minGRU as an encoder and GRU as a decoder. It normalizes a dialect sentence into a standard sentence, and it employs alphabet-level tokenization. The other type of sub-translator is a legacy translator, such as off-the-shelf neural machine translators or LLMs, which translates the normalized standard sentence to a foreign sentence. The effectiveness of the alphabet-level tokenization and the minGRU encoder for the normalization model is demonstrated through empirical analysis. Alphabet-level tokenization is proven to be more effective for Korean dialect normalization than other widely used sub-word tokenizations. The minGRU encoder exhibits comparable performance to GRU as an encoder, and it is faster and more effective in managing longer token sequences. The pivot-based translation method is also validated through a broad range of experiments, and its effectiveness in translating Korean dialects to English, Chinese, and Japanese is demonstrated empirically.
ISSN:	2076-3417
DOI:	10.3390/app15179459
Fuente:	Publicly Available Content Database