Evaluating the Impact of Synthetic Data on Emotion Classification: A Linguistic and Structural Analysis

محفوظ في:
التفاصيل البيبلوغرافية
الحاوية / القاعدة:Information vol. 16, no. 4 (2025), p. 330
المؤلف الرئيسي: Üveges István
مؤلفون آخرون: Ring Orsolya
منشور في:
MDPI AG
الموضوعات:
الوصول للمادة أونلاين:Citation/Abstract
Full Text + Graphics
Full Text - PDF
الوسوم: إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!

MARC

LEADER 00000nab a2200000uu 4500
001 3194615642
003 UK-CbPIL
022 |a 2078-2489 
024 7 |a 10.3390/info16040330  |2 doi 
035 |a 3194615642 
045 2 |b d20250401  |b d20250430 
084 |a 231474  |2 nlm 
100 1 |a Üveges István 
245 1 |a Evaluating the Impact of Synthetic Data on Emotion Classification: A Linguistic and Structural Analysis 
260 |b MDPI AG  |c 2025 
513 |a Journal Article 
520 3 |a Emotion classification in natural language processing (NLP) has recently witnessed significant advancements. However, class imbalance in emotion datasets remains a critical challenge, as dominant emotion categories tend to overshadow less frequent ones, leading to biased model predictions. Traditional techniques, such as undersampling and oversampling, offer partial solutions. More recently, synthetic data generation using large language models (LLMs) has emerged as a promising strategy for augmenting minority classes and improving model robustness. In this study, we investigate the impact of synthetic data augmentation on German-language emotion classification. Using an imbalanced dataset, we systematically evaluate multiple balancing strategies, including undersampling overrepresented classes and generating synthetic data for underrepresented emotions using a GPT-4–based model in a few-shot prompting setting. Beyond enhancing model performance, we conduct a detailed linguistic analysis of the synthetic samples, examining their lexical diversity, syntactic structures, and semantic coherence to determine their contribution to overall model generalization. Our results demonstrate that integrating synthetic data significantly improves classification performance, particularly for minority emotion categories, while maintaining overall model stability. However, our linguistic evaluation reveals that synthetic examples exhibit reduced lexical diversity and simplified syntactic structures, which may introduce limitations in certain real-world applications. These findings highlight both the potential and the challenges of synthetic data augmentation in emotion classification. By providing a comprehensive evaluation of balancing techniques and the linguistic properties of generated text, this study contributes to the ongoing discourse on improving NLP models for underrepresented linguistic phenomena. 
653 |a Dictionaries 
653 |a Datasets 
653 |a Classification 
653 |a Balancing 
653 |a Structural analysis 
653 |a Performance evaluation 
653 |a Emotions 
653 |a Linguistics 
653 |a Coherence 
653 |a Semantics 
653 |a Data augmentation 
653 |a Large language models 
653 |a Sentiment analysis 
653 |a Syntax 
653 |a Syntactic structures 
653 |a Natural language processing 
653 |a German language 
653 |a Multilingualism 
653 |a Language modeling 
653 |a Morphology 
653 |a Synthetic data 
653 |a Marginality 
653 |a Property 
653 |a Robustness 
653 |a Data collection 
653 |a Augmentation 
653 |a Data 
653 |a Imbalance 
653 |a Language 
700 1 |a Ring Orsolya 
773 0 |t Information  |g vol. 16, no. 4 (2025), p. 330 
786 0 |d ProQuest  |t Advanced Technologies & Aerospace Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3194615642/abstract/embedded/7BTGNMKEMPT1V9Z2?source=fedsrch 
856 4 0 |3 Full Text + Graphics  |u https://www.proquest.com/docview/3194615642/fulltextwithgraphics/embedded/7BTGNMKEMPT1V9Z2?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3194615642/fulltextPDF/embedded/7BTGNMKEMPT1V9Z2?source=fedsrch