A Radical-Based Token Representation Method for Enhancing Chinese Pre-Trained Language Models

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Electronics vol. 14, no. 5 (2025), p. 1031
1. Verfasser: Qin, Honglun
Weitere Verfasser: Li, Meiwen, Wang, Lin, Ge, Youming, Zhu, Junlong, Zheng, Ruijuan
Veröffentlicht:
MDPI AG
Schlagworte:
Online-Zugang:Citation/Abstract
Full Text + Graphics
Full Text - PDF
Tags: Tag hinzufügen
Keine Tags, Fügen Sie das erste Tag hinzu!

MARC

LEADER 00000nab a2200000uu 4500
001 3176380797
003 UK-CbPIL
022 |a 2079-9292 
024 7 |a 10.3390/electronics14051031  |2 doi 
035 |a 3176380797 
045 2 |b d20250101  |b d20251231 
084 |a 231458  |2 nlm 
100 1 |a Qin, Honglun  |u School of Information Engineering, Henan University of Science and Technology, Luoyang 471023, China; <email>honglunqin@stu.haust.edu.cn</email> (H.Q.); <email>linwang@haust.edu.cn</email> (L.W.); <email>geym@haust.edu.cn</email> (Y.G.); <email>jlzhu@haust.edu.cn</email> (J.Z.) 
245 1 |a A Radical-Based Token Representation Method for Enhancing Chinese Pre-Trained Language Models 
260 |b MDPI AG  |c 2025 
513 |a Journal Article 
520 3 |a In the domain of natural language processing (NLP), a primary challenge pertains to the process of Chinese tokenization, which remains challenging due to the lack of explicit word boundaries in written Chinese. The existing tokenization methods often treat each Chinese character as an indivisible unit, neglecting the finer semantic features embedded in the characters, such as radicals. To tackle this issue, we propose a novel token representation method that integrates radical-based features into the process. The proposed method extends the vocabulary to include both radicals and original character tokens, enabling a more granular understanding of Chinese text. We also conduct experiments on seven datasets covering multiple Chinese natural language processing tasks. The results show that our method significantly improves model performance on downstream tasks. Specifically, the accuracy of BERT on the BQ Croups dataset was enhanced to 86.95%, showing an improvement of 1.65% over the baseline. Additionally, the BERT-wwm performance demonstrated a 1.28% enhancement, suggesting that the incorporation of fine-grained radical features offers a more efficacious solution for Chinese tokenization and paves the way for future research in Chinese text processing. 
653 |a Word processing 
653 |a Language 
653 |a Datasets 
653 |a Semantic features 
653 |a Deep learning 
653 |a Natural language processing 
653 |a Word boundaries 
653 |a Methods 
653 |a Phonetics 
653 |a Language modeling 
653 |a Chinese languages 
653 |a Morphology 
653 |a Representations 
653 |a Efficiency 
653 |a Semantics 
653 |a Experiments 
653 |a Personality 
653 |a Task performance 
653 |a Vocabulary 
700 1 |a Li, Meiwen  |u School of Software, Henan University of Science and Technology, Luoyang 471023, China; <email>zhengruijuan@haust.edu.cn</email> 
700 1 |a Wang, Lin  |u School of Information Engineering, Henan University of Science and Technology, Luoyang 471023, China; <email>honglunqin@stu.haust.edu.cn</email> (H.Q.); <email>linwang@haust.edu.cn</email> (L.W.); <email>geym@haust.edu.cn</email> (Y.G.); <email>jlzhu@haust.edu.cn</email> (J.Z.) 
700 1 |a Ge, Youming  |u School of Information Engineering, Henan University of Science and Technology, Luoyang 471023, China; <email>honglunqin@stu.haust.edu.cn</email> (H.Q.); <email>linwang@haust.edu.cn</email> (L.W.); <email>geym@haust.edu.cn</email> (Y.G.); <email>jlzhu@haust.edu.cn</email> (J.Z.) 
700 1 |a Zhu, Junlong  |u School of Information Engineering, Henan University of Science and Technology, Luoyang 471023, China; <email>honglunqin@stu.haust.edu.cn</email> (H.Q.); <email>linwang@haust.edu.cn</email> (L.W.); <email>geym@haust.edu.cn</email> (Y.G.); <email>jlzhu@haust.edu.cn</email> (J.Z.) 
700 1 |a Zheng, Ruijuan  |u School of Software, Henan University of Science and Technology, Luoyang 471023, China; <email>zhengruijuan@haust.edu.cn</email> 
773 0 |t Electronics  |g vol. 14, no. 5 (2025), p. 1031 
786 0 |d ProQuest  |t Advanced Technologies & Aerospace Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3176380797/abstract/embedded/L8HZQI7Z43R0LA5T?source=fedsrch 
856 4 0 |3 Full Text + Graphics  |u https://www.proquest.com/docview/3176380797/fulltextwithgraphics/embedded/L8HZQI7Z43R0LA5T?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3176380797/fulltextPDF/embedded/L8HZQI7Z43R0LA5T?source=fedsrch