Fine-Tuning Large Language Models for Kazakh Text Simplification

Uloženo v:
Podrobná bibliografie
Vydáno v:Applied Sciences vol. 15, no. 15 (2025), p. 8344-8367
Hlavní autor: Alymzhan, Toleu
Další autoři: Gulmira, Tolegen, Ualiyeva Irina
Vydáno:
MDPI AG
Témata:
On-line přístup:Citation/Abstract
Full Text + Graphics
Full Text - PDF
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

MARC

LEADER 00000nab a2200000uu 4500
001 3239020943
003 UK-CbPIL
022 |a 2076-3417 
024 7 |a 10.3390/app15158344  |2 doi 
035 |a 3239020943 
045 2 |b d20250101  |b d20251231 
084 |a 231338  |2 nlm 
100 1 |a Alymzhan, Toleu  |u Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan; a.toleu@satbayev.universityg.tolegen@satbayev.university (G.T.) 
245 1 |a Fine-Tuning Large Language Models for Kazakh Text Simplification 
260 |b MDPI AG  |c 2025 
513 |a Journal Article 
520 3 |a This paper addresses text simplification task for Kazakh, a morphologically rich, low-resource language, by introducing KazSim, an instruction-tuned model built on multilingual large language models (LLMs). First, we develop a heuristic pipeline to identify complex Kazakh sentences, manually validating its performance on 400 examples and comparing it against a purely LLM-based selection method; we then use this pipeline to assemble a parallel corpus of 8709 complex–simple pairs via LLM augmentation. For the simplification task, we benchmark KazSim against standard Seq2Seq systems, domain-adapted Kazakh LLMs, and zero-shot instruction-following models. On an automatically constructed test set, KazSim (Llama-3.3-70B) achieves BLEU 33.50, SARI 56.38, and F1 87.56 with a length ratio of 0.98, outperforming all baselines. We also explore prompt language (English vs. Kazakh) and conduct human evaluation with three native speakers: KazSim scores 4.08 for fluency, 4.09 for meaning preservation, and 4.42 for simplicity—significantly above GPT-4o-mini. Error analysis shows that remaining failures cluster into tone change, tense change, and semantic drift, reflecting Kazakh’s agglutinative morphology and flexible syntax. 
653 |a Error analysis 
653 |a Language 
653 |a Readability 
653 |a Machine learning 
653 |a Simplified language 
653 |a Word sense disambiguation 
653 |a Datasets 
653 |a Fluency 
653 |a Parallel corpora 
653 |a Syntax 
653 |a Semantic change 
653 |a Benchmarks 
653 |a Kazakh language 
653 |a Linear programming 
653 |a Methods 
653 |a Natural language processing 
653 |a Multilingualism 
653 |a Linguistics 
653 |a Language modeling 
653 |a Heuristic 
653 |a Tense 
653 |a Large language models 
653 |a Teaching 
653 |a English language 
653 |a Morphology 
653 |a Simplicity 
653 |a Simplification 
653 |a Preservation 
700 1 |a Gulmira, Tolegen  |u Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan; a.toleu@satbayev.universityg.tolegen@satbayev.university (G.T.) 
700 1 |a Ualiyeva Irina  |u Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan; a.toleu@satbayev.universityg.tolegen@satbayev.university (G.T.) 
773 0 |t Applied Sciences  |g vol. 15, no. 15 (2025), p. 8344-8367 
786 0 |d ProQuest  |t Publicly Available Content Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3239020943/abstract/embedded/H09TXR3UUZB2ISDL?source=fedsrch 
856 4 0 |3 Full Text + Graphics  |u https://www.proquest.com/docview/3239020943/fulltextwithgraphics/embedded/H09TXR3UUZB2ISDL?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3239020943/fulltextPDF/embedded/H09TXR3UUZB2ISDL?source=fedsrch