Bangla Speech Emotion Recognition Using Deep Learning-Based Ensemble Learning and Feature Fusion

Uloženo v:

Podrobná bibliografie
Vydáno v:	Journal of Imaging vol. 11, no. 8 (2025), p. 273-297
Hlavní autor:	Shakil Md. Shahid Ahammed
Další autoři:	Farid Fahmid Al, Podder, Nitun Kumar, Iqbal S. M. Hasan Sazzad, Miah Abu Saleh Musa, Rahim Md Abdur, Karim, Hezerul Abdul
Vydáno:	MDPI AG
Témata:	Accuracy Deep learning Datasets Computer architecture Artificial neural networks Signal processing Voting Machine learning Emotions Representations Data augmentation Human-computer interface Emotion recognition Centroids Neural networks Classification Effectiveness Multilingualism Algorithms Robustness (mathematics) Ensemble learning Speech Speech recognition
On-line přístup:	Citation/Abstract Full Text + Graphics Full Text - PDF
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

MARC


LEADER	00000nab a2200000uu 4500
001	3244043498
003	UK-CbPIL
022			\|a 2313-433X
024	7		\|a 10.3390/jimaging11080273 \|2 doi
035			\|a 3244043498
045	2		\|b d20250101 \|b d20251231
100	1		\|a Shakil Md. Shahid Ahammed \|u Department of Computer Science and Engineering, Pabna University of Science and Technology, Pabna 6600, Bangladesh; shakil@vu.edu.bd (M.S.A.S.); nitun@pust.ac.bd (N.K.P.); sazzad@pust.ac.bd (S.M.H.S.I.)
245	1		\|a Bangla Speech Emotion Recognition Using Deep Learning-Based Ensemble Learning and Feature Fusion
260			\|b MDPI AG \|c 2025
513			\|a Journal Article
520	3		\|a Emotion recognition in speech is essential for enhancing human–computer interaction (HCI) systems. Despite progress in Bangla speech emotion recognition, challenges remain, including low accuracy, speaker dependency, and poor generalization across emotional expressions. Previous approaches often rely on traditional machine learning or basic deep learning models, struggling with robustness and accuracy in noisy or varied data. In this study, we propose a novel multi-stream deep learning feature fusion approach for Bangla speech emotion recognition, addressing the limitations of existing methods. Our approach begins with various data augmentation techniques applied to the training dataset, enhancing the model’s robustness and generalization. We then extract a comprehensive set of handcrafted features, including Zero-Crossing Rate (ZCR), chromagram, spectral centroid, spectral roll-off, spectral contrast, spectral flatness, Mel-Frequency Cepstral Coefficients (MFCCs), Root Mean Square (RMS) energy, and Mel-spectrogram. Although these features are used as 1D numerical vectors, some of them are computed from time–frequency representations (e.g., chromagram, Mel-spectrogram) that can themselves be depicted as images, which is conceptually close to imaging-based analysis. These features capture key characteristics of the speech signal, providing valuable insights into the emotional content. Sequentially, we utilize a multi-stream deep learning architecture to automatically learn complex, hierarchical representations of the speech signal. This architecture consists of three distinct streams: the first stream uses 1D convolutional neural networks (1D CNNs), the second integrates 1D CNN with Long Short-Term Memory (LSTM), and the third combines 1D CNNs with bidirectional LSTM (Bi-LSTM). These models capture intricate emotional nuances that handcrafted features alone may not fully represent. For each of these models, we generate predicted scores and then employ ensemble learning with a soft voting technique to produce the final prediction. This fusion of handcrafted features, deep learning-derived features, and ensemble voting enhances the accuracy and robustness of emotion identification across multiple datasets. Our method demonstrates the effectiveness of combining various learning models to improve emotion recognition in Bangla speech, providing a more comprehensive solution compared with existing methods. We utilize three primary datasets—SUBESCO, BanglaSER, and a merged version of both—as well as two external datasets, RAVDESS and EMODB, to assess the performance of our models. Our method achieves impressive results with accuracies of 92.90%, 85.20%, 90.63%, 67.71%, and 69.25% for the SUBESCO, BanglaSER, merged SUBESCO and BanglaSER, RAVDESS, and EMODB datasets, respectively. These results demonstrate the effectiveness of combining handcrafted features with deep learning-based features through ensemble learning for robust emotion recognition in Bangla speech.
653			\|a Accuracy
653			\|a Deep learning
653			\|a Datasets
653			\|a Computer architecture
653			\|a Artificial neural networks
653			\|a Signal processing
653			\|a Voting
653			\|a Machine learning
653			\|a Emotions
653			\|a Representations
653			\|a Data augmentation
653			\|a Human-computer interface
653			\|a Emotion recognition
653			\|a Centroids
653			\|a Neural networks
653			\|a Classification
653			\|a Effectiveness
653			\|a Multilingualism
653			\|a Algorithms
653			\|a Robustness (mathematics)
653			\|a Ensemble learning
653			\|a Speech
653			\|a Speech recognition
700	1		\|a Farid Fahmid Al \|u Centre for Image and Vision Computing (CIVC), COE for Artificial Intelligence, Faculty of Artificial Intelligence and Engineering (FAIE), Multimedia University, Cyberjaya 63100, Selangor, Malaysia
700	1		\|a Podder, Nitun Kumar \|u Department of Computer Science and Engineering, Pabna University of Science and Technology, Pabna 6600, Bangladesh; shakil@vu.edu.bd (M.S.A.S.); nitun@pust.ac.bd (N.K.P.); sazzad@pust.ac.bd (S.M.H.S.I.)
700	1		\|a Iqbal S. M. Hasan Sazzad \|u Department of Computer Science and Engineering, Pabna University of Science and Technology, Pabna 6600, Bangladesh; shakil@vu.edu.bd (M.S.A.S.); nitun@pust.ac.bd (N.K.P.); sazzad@pust.ac.bd (S.M.H.S.I.)
700	1		\|a Miah Abu Saleh Musa \|u Department of Computer Science and Engineering, Bangladesh Army University of Science and Technology (BAUST), Saidpur 5311, Bangladesh
700	1		\|a Rahim Md Abdur \|u Department of Computer Science and Engineering, Pabna University of Science and Technology, Pabna 6600, Bangladesh; shakil@vu.edu.bd (M.S.A.S.); nitun@pust.ac.bd (N.K.P.); sazzad@pust.ac.bd (S.M.H.S.I.)
700	1		\|a Karim, Hezerul Abdul \|u Centre for Image and Vision Computing (CIVC), COE for Artificial Intelligence, Faculty of Artificial Intelligence and Engineering (FAIE), Multimedia University, Cyberjaya 63100, Selangor, Malaysia
773	0		\|t Journal of Imaging \|g vol. 11, no. 8 (2025), p. 273-297
786	0		\|d ProQuest \|t Advanced Technologies & Aerospace Database
856	4	1	\|3 Citation/Abstract \|u https://www.proquest.com/docview/3244043498/abstract/embedded/6A8EOT78XXH2IG52?source=fedsrch
856	4	0	\|3 Full Text + Graphics \|u https://www.proquest.com/docview/3244043498/fulltextwithgraphics/embedded/6A8EOT78XXH2IG52?source=fedsrch
856	4	0	\|3 Full Text - PDF \|u https://www.proquest.com/docview/3244043498/fulltextPDF/embedded/6A8EOT78XXH2IG52?source=fedsrch