Enhancing speech emotion recognition through parallel CNNs with transformer encoder and co-attention

Guardado en:
Bibliografiske detaljer
Udgivet i:PeerJ Computer Science (Nov 7, 2025)
Hovedforfatter: Hashem, Ahlam
Andre forfattere: Arif, Muhammad, Alghamdi, Manal, Al Ghamdi, Mohammed A, Almotiri, Sultan H
Udgivet:
PeerJ, Inc.
Fag:
Online adgang:Citation/Abstract
Full Text + Graphics
Tags: Tilføj Tag
Ingen Tags, Vær først til at tagge denne postø!
Beskrivelse
Resumen:Speech emotion recognition (SER) plays a pivotal role in enabling machines to determine human subjective emotions based only on audio information. This capability is essential for enabling effective communication and enhancing the user experience in human-computer interactions (HCI). Recent studies have successfully integrated temporal and spatial features to improve recognition accuracy. This study presents a novel approach that integrates parallel convolutional neural networks (CNNs) with a Transformer encoder and incorporates a collaborative attention mechanism (co-attention) to extract spatiotemporal features from audio samples. The proposed model is evaluated on multiple datasets and uses various fusion methods. The parallel CNNs combined with a transformer and hierarchical co-attention yield the most promising performance. In version v1 of the ASVP-ESD dataset, the proposed model achieves a weighted accuracy (WA) of 70% and an unweighted accuracy (UW) of 67%. In version 2 of the ASVP-ESD dataset, the model achieves a WA of 52% and a UW of 45%. Furthermore, the model was evaluated on the ShEMO data set to confirm its robustness and effectiveness in diverse datasets, achieving a UW of 68%. These comprehensive evaluations across multiple datasets highlight the generalizability of the proposed approach.
ISSN:2376-5992
DOI:10.7717/peerj-cs.3254
Fuente:Advanced Technologies & Aerospace Database