Enhancing speech emotion recognition through parallel CNNs with transformer encoder and co-attention

Spremljeno u:

Bibliografski detalji
Izdano u:	PeerJ Computer Science (Nov 7, 2025)
Glavni autor:	Hashem, Ahlam
Daljnji autori:	Arif, Muhammad, Alghamdi, Manal, Al Ghamdi, Mohammed A, Almotiri, Sultan H
Izdano:	PeerJ, Inc.
Teme:	Communication Artificial neural networks Recognition Neural networks User experience Emotions Coders Datasets Human-computer interaction Human-computer interface Computer mediated communication Emotion recognition Attention Classification Effectiveness Phonetics Algorithms Audio data Acoustics Ablation Interpersonal communication Speech Speech recognition Subjectivity Human technology relationship Robustness Models Acknowledgment Generalizability Machinery Accuracy
Online pristup:	Citation/Abstract Full Text + Graphics
Oznake:	Dodaj oznaku Bez oznaka, Budi prvi tko označuje ovaj zapis!

Opis
Sažetak:	Speech emotion recognition (SER) plays a pivotal role in enabling machines to determine human subjective emotions based only on audio information. This capability is essential for enabling effective communication and enhancing the user experience in human-computer interactions (HCI). Recent studies have successfully integrated temporal and spatial features to improve recognition accuracy. This study presents a novel approach that integrates parallel convolutional neural networks (CNNs) with a Transformer encoder and incorporates a collaborative attention mechanism (co-attention) to extract spatiotemporal features from audio samples. The proposed model is evaluated on multiple datasets and uses various fusion methods. The parallel CNNs combined with a transformer and hierarchical co-attention yield the most promising performance. In version v1 of the ASVP-ESD dataset, the proposed model achieves a weighted accuracy (WA) of 70% and an unweighted accuracy (UW) of 67%. In version 2 of the ASVP-ESD dataset, the model achieves a WA of 52% and a UW of 45%. Furthermore, the model was evaluated on the ShEMO data set to confirm its robustness and effectiveness in diverse datasets, achieving a UW of 68%. These comprehensive evaluations across multiple datasets highlight the generalizability of the proposed approach.
ISSN:	2376-5992
Digitalni identifikator objekta:	10.7717/peerj-cs.3254
Izvor:	Advanced Technologies & Aerospace Database