Enhancing speech emotion recognition through parallel CNNs with transformer encoder and co-attention
Spremljeno u:
| Izdano u: | PeerJ Computer Science (Nov 7, 2025) |
|---|---|
| Glavni autor: | |
| Daljnji autori: | , , , |
| Izdano: |
PeerJ, Inc.
|
| Teme: | |
| Online pristup: | Citation/Abstract Full Text + Graphics |
| Oznake: |
Bez oznaka, Budi prvi tko označuje ovaj zapis!
|
| Sažetak: | Speech emotion recognition (SER) plays a pivotal role in enabling machines to determine human subjective emotions based only on audio information. This capability is essential for enabling effective communication and enhancing the user experience in human-computer interactions (HCI). Recent studies have successfully integrated temporal and spatial features to improve recognition accuracy. This study presents a novel approach that integrates parallel convolutional neural networks (CNNs) with a Transformer encoder and incorporates a collaborative attention mechanism (co-attention) to extract spatiotemporal features from audio samples. The proposed model is evaluated on multiple datasets and uses various fusion methods. The parallel CNNs combined with a transformer and hierarchical co-attention yield the most promising performance. In version v1 of the ASVP-ESD dataset, the proposed model achieves a weighted accuracy (WA) of 70% and an unweighted accuracy (UW) of 67%. In version 2 of the ASVP-ESD dataset, the model achieves a WA of 52% and a UW of 45%. Furthermore, the model was evaluated on the ShEMO data set to confirm its robustness and effectiveness in diverse datasets, achieving a UW of 68%. These comprehensive evaluations across multiple datasets highlight the generalizability of the proposed approach. |
|---|---|
| ISSN: | 2376-5992 |
| Digitalni identifikator objekta: | 10.7717/peerj-cs.3254 |
| Izvor: | Advanced Technologies & Aerospace Database |