Speech Emotion Recognition on MELD and RAVDESS Datasets Using CNN

Guardado en:

Detalles Bibliográficos
Publicado en:	Information vol. 16, no. 7 (2025), p. 518-536
Autor principal:	Waleed, Gheed T
Otros Autores:	Shaker, Shaimaa H
Publicado:	MDPI AG
Materias:	Data augmentation Accuracy Embedded systems Deep learning Datasets Wavelet transforms Affective computing Artificial intelligence Human-computer interface Emotion recognition Spectrograms Artificial neural networks Neural networks Ablation Support vector machines Random noise Emotions Methods Acoustics Real time Speech Speech recognition
Acceso en línea:	Citation/Abstract Full Text + Graphics Full Text - PDF
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

Descripción
Resumen:	Speech emotion recognition (SER) plays a vital role in enhancing human–computer interaction (HCI) and can be applied in affective computing, virtual support, and healthcare. This research presents a high-performance SER framework based on a lightweight 1D Convolutional Neural Network (1D-CNN) and a multi-feature fusion technique. Rather than employing spectrograms as image-based input, frame-level characteristics (Mel-Frequency Cepstral Coefficients, Mel-Spectrograms, and Chroma vectors) are calculated throughout the sequences to preserve temporal information and reduce the computing expense. The model attained classification accuracies of 94.0% on MELD (multi-party talks) and 91.9% on RAVDESS (acted speech). Ablation experiments demonstrate that the integration of complimentary features significantly outperforms the utilisation of a singular feature as a baseline. Data augmentation techniques, including Gaussian noise and time shifting, enhance model generalisation. The proposed method demonstrates significant potential for real-time emotion recognition using audio only in embedded or resource-constrained devices.
ISSN:	2078-2489
DOI:	10.3390/info16070518
Fuente:	Advanced Technologies & Aerospace Database