Enhancing speech emotion recognition through parallel CNNs with transformer encoder and co-attention
Tallennettuna:
| Julkaisussa: | PeerJ Computer Science (Nov 7, 2025) |
|---|---|
| Päätekijä: | |
| Muut tekijät: | , , , |
| Julkaistu: |
PeerJ, Inc.
|
| Aiheet: | |
| Linkit: | Citation/Abstract Full Text + Graphics |
| Tagit: |
Ei tageja, Lisää ensimmäinen tagi!
|
| Abstrakti: | Speech emotion recognition (SER) plays a pivotal role in enabling machines to determine human subjective emotions based only on audio information. This capability is essential for enabling effective communication and enhancing the user experience in human-computer interactions (HCI). Recent studies have successfully integrated temporal and spatial features to improve recognition accuracy. This study presents a novel approach that integrates parallel convolutional neural networks (CNNs) with a Transformer encoder and incorporates a collaborative attention mechanism (co-attention) to extract spatiotemporal features from audio samples. The proposed model is evaluated on multiple datasets and uses various fusion methods. The parallel CNNs combined with a transformer and hierarchical co-attention yield the most promising performance. In version v1 of the ASVP-ESD dataset, the proposed model achieves a weighted accuracy (WA) of 70% and an unweighted accuracy (UW) of 67%. In version 2 of the ASVP-ESD dataset, the model achieves a WA of 52% and a UW of 45%. Furthermore, the model was evaluated on the ShEMO data set to confirm its robustness and effectiveness in diverse datasets, achieving a UW of 68%. These comprehensive evaluations across multiple datasets highlight the generalizability of the proposed approach. |
|---|---|
| ISSN: | 2376-5992 |
| DOI: | 10.7717/peerj-cs.3254 |
| Lähde: | Advanced Technologies & Aerospace Database |