A Context-Preserving Tokenization Mismatch Resolution Method for Korean Word Sense Disambiguation Based on the Sejong Corpus and BERT

Salvato in:
Dettagli Bibliografici
Pubblicato in:Mathematics vol. 13, no. 5 (2025), p. 864
Autore principale: Jeong, Hanjo
Pubblicazione:
MDPI AG
Soggetti:
Accesso online:Citation/Abstract
Full Text + Graphics
Full Text - PDF
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!

MARC

LEADER 00000nab a2200000uu 4500
001 3176338416
003 UK-CbPIL
022 |a 2227-7390 
024 7 |a 10.3390/math13050864  |2 doi 
035 |a 3176338416 
045 2 |b d20250101  |b d20251231 
084 |a 231533  |2 nlm 
100 1 |a Jeong, Hanjo 
245 1 |a A Context-Preserving Tokenization Mismatch Resolution Method for Korean Word Sense Disambiguation Based on the Sejong Corpus and BERT 
260 |b MDPI AG  |c 2025 
513 |a Journal Article 
520 3 |a The disambiguation of word senses (Word Sense Disambiguation, WSD) plays a crucial role in various natural language processing (NLP) tasks, such as machine translation, sentiment analysis, and information retrieval. Due to the complex morphological structure and polysemy of the Korean language, the meaning of words can change depending on the context, making the WSD problem challenging. Since a single word can have multiple meanings, accurately distinguishing between them is essential for improving the performance of NLP models. Recently, large-scale pre-trained models like BERT and GPT, based on transfer learning, have shown promising results in addressing this issue. However, for languages with complex morphological structures, like Korean, the tokenization mismatch between pre-trained models and fine-tuning data prevents the rich contextual and lexical information learned by the pre-trained models from being fully utilized in downstream tasks. This paper proposes a novel method to address the tokenization mismatch issue during the fine-tuning of Korean WSD, leveraging BERT-based pre-trained models and the Sejong corpus, which has been annotated by language experts. Experimental results using various BERT-based pre-trained models and datasets from the Sejong corpus demonstrate that the proposed method improves performance by approximately 3–5% compared to existing approaches. 
653 |a Language 
653 |a Word sense disambiguation 
653 |a Datasets 
653 |a Information retrieval 
653 |a Words (language) 
653 |a Context 
653 |a Corpus linguistics 
653 |a Machine translation 
653 |a Polysemy 
653 |a Machine learning 
653 |a Performance enhancement 
653 |a Sentiment analysis 
653 |a Neural networks 
653 |a Natural language processing 
653 |a Semantic analysis 
653 |a Multilingualism 
653 |a Algorithms 
653 |a Morphological complexity 
653 |a Annotations 
653 |a Korean language 
653 |a Morphology 
653 |a Semantics 
653 |a Meaning 
653 |a Data mining 
653 |a Retrieval 
653 |a Language shift 
653 |a Languages 
773 0 |t Mathematics  |g vol. 13, no. 5 (2025), p. 864 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3176338416/abstract/embedded/7BTGNMKEMPT1V9Z2?source=fedsrch 
856 4 0 |3 Full Text + Graphics  |u https://www.proquest.com/docview/3176338416/fulltextwithgraphics/embedded/7BTGNMKEMPT1V9Z2?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3176338416/fulltextPDF/embedded/7BTGNMKEMPT1V9Z2?source=fedsrch