Fine-Tuning Large Language Models for Kazakh Text Simplification
Uloženo v:
| Vydáno v: | Applied Sciences vol. 15, no. 15 (2025), p. 8344-8367 |
|---|---|
| Hlavní autor: | |
| Další autoři: | , |
| Vydáno: |
MDPI AG
|
| Témata: | |
| On-line přístup: | Citation/Abstract Full Text + Graphics Full Text - PDF |
| Tagy: |
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
MARC
| LEADER | 00000nab a2200000uu 4500 | ||
|---|---|---|---|
| 001 | 3239020943 | ||
| 003 | UK-CbPIL | ||
| 022 | |a 2076-3417 | ||
| 024 | 7 | |a 10.3390/app15158344 |2 doi | |
| 035 | |a 3239020943 | ||
| 045 | 2 | |b d20250101 |b d20251231 | |
| 084 | |a 231338 |2 nlm | ||
| 100 | 1 | |a Alymzhan, Toleu |u Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan; a.toleu@satbayev.universityg.tolegen@satbayev.university (G.T.) | |
| 245 | 1 | |a Fine-Tuning Large Language Models for Kazakh Text Simplification | |
| 260 | |b MDPI AG |c 2025 | ||
| 513 | |a Journal Article | ||
| 520 | 3 | |a This paper addresses text simplification task for Kazakh, a morphologically rich, low-resource language, by introducing KazSim, an instruction-tuned model built on multilingual large language models (LLMs). First, we develop a heuristic pipeline to identify complex Kazakh sentences, manually validating its performance on 400 examples and comparing it against a purely LLM-based selection method; we then use this pipeline to assemble a parallel corpus of 8709 complex–simple pairs via LLM augmentation. For the simplification task, we benchmark KazSim against standard Seq2Seq systems, domain-adapted Kazakh LLMs, and zero-shot instruction-following models. On an automatically constructed test set, KazSim (Llama-3.3-70B) achieves BLEU 33.50, SARI 56.38, and F1 87.56 with a length ratio of 0.98, outperforming all baselines. We also explore prompt language (English vs. Kazakh) and conduct human evaluation with three native speakers: KazSim scores 4.08 for fluency, 4.09 for meaning preservation, and 4.42 for simplicity—significantly above GPT-4o-mini. Error analysis shows that remaining failures cluster into tone change, tense change, and semantic drift, reflecting Kazakh’s agglutinative morphology and flexible syntax. | |
| 653 | |a Error analysis | ||
| 653 | |a Language | ||
| 653 | |a Readability | ||
| 653 | |a Machine learning | ||
| 653 | |a Simplified language | ||
| 653 | |a Word sense disambiguation | ||
| 653 | |a Datasets | ||
| 653 | |a Fluency | ||
| 653 | |a Parallel corpora | ||
| 653 | |a Syntax | ||
| 653 | |a Semantic change | ||
| 653 | |a Benchmarks | ||
| 653 | |a Kazakh language | ||
| 653 | |a Linear programming | ||
| 653 | |a Methods | ||
| 653 | |a Natural language processing | ||
| 653 | |a Multilingualism | ||
| 653 | |a Linguistics | ||
| 653 | |a Language modeling | ||
| 653 | |a Heuristic | ||
| 653 | |a Tense | ||
| 653 | |a Large language models | ||
| 653 | |a Teaching | ||
| 653 | |a English language | ||
| 653 | |a Morphology | ||
| 653 | |a Simplicity | ||
| 653 | |a Simplification | ||
| 653 | |a Preservation | ||
| 700 | 1 | |a Gulmira, Tolegen |u Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan; a.toleu@satbayev.universityg.tolegen@satbayev.university (G.T.) | |
| 700 | 1 | |a Ualiyeva Irina |u Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan; a.toleu@satbayev.universityg.tolegen@satbayev.university (G.T.) | |
| 773 | 0 | |t Applied Sciences |g vol. 15, no. 15 (2025), p. 8344-8367 | |
| 786 | 0 | |d ProQuest |t Publicly Available Content Database | |
| 856 | 4 | 1 | |3 Citation/Abstract |u https://www.proquest.com/docview/3239020943/abstract/embedded/H09TXR3UUZB2ISDL?source=fedsrch |
| 856 | 4 | 0 | |3 Full Text + Graphics |u https://www.proquest.com/docview/3239020943/fulltextwithgraphics/embedded/H09TXR3UUZB2ISDL?source=fedsrch |
| 856 | 4 | 0 | |3 Full Text - PDF |u https://www.proquest.com/docview/3239020943/fulltextPDF/embedded/H09TXR3UUZB2ISDL?source=fedsrch |