Enhancing speech emotion recognition through parallel CNNs with transformer encoder and co-attention
Na minha lista:
| Publicado no: | PeerJ Computer Science (Nov 7, 2025) |
|---|---|
| Autor principal: | |
| Outros Autores: | , , , |
| Publicado em: |
PeerJ, Inc.
|
| Assuntos: | |
| Acesso em linha: | Citation/Abstract Full Text + Graphics |
| Tags: |
Sem tags, seja o primeiro a adicionar uma tag!
|
MARC
| LEADER | 00000nab a2200000uu 4500 | ||
|---|---|---|---|
| 001 | 3269734304 | ||
| 003 | UK-CbPIL | ||
| 022 | |a 2376-5992 | ||
| 024 | 7 | |a 10.7717/peerj-cs.3254 |2 doi | |
| 035 | |a 3269734304 | ||
| 045 | 0 | |b d20251107 | |
| 100 | 1 | |a Hashem, Ahlam | |
| 245 | 1 | |a Enhancing speech emotion recognition through parallel CNNs with transformer encoder and co-attention | |
| 260 | |b PeerJ, Inc. |c Nov 7, 2025 | ||
| 513 | |a Journal Article | ||
| 520 | 3 | |a Speech emotion recognition (SER) plays a pivotal role in enabling machines to determine human subjective emotions based only on audio information. This capability is essential for enabling effective communication and enhancing the user experience in human-computer interactions (HCI). Recent studies have successfully integrated temporal and spatial features to improve recognition accuracy. This study presents a novel approach that integrates parallel convolutional neural networks (CNNs) with a Transformer encoder and incorporates a collaborative attention mechanism (co-attention) to extract spatiotemporal features from audio samples. The proposed model is evaluated on multiple datasets and uses various fusion methods. The parallel CNNs combined with a transformer and hierarchical co-attention yield the most promising performance. In version v1 of the ASVP-ESD dataset, the proposed model achieves a weighted accuracy (WA) of 70% and an unweighted accuracy (UW) of 67%. In version 2 of the ASVP-ESD dataset, the model achieves a WA of 52% and a UW of 45%. Furthermore, the model was evaluated on the ShEMO data set to confirm its robustness and effectiveness in diverse datasets, achieving a UW of 68%. These comprehensive evaluations across multiple datasets highlight the generalizability of the proposed approach. | |
| 653 | |a Communication | ||
| 653 | |a Artificial neural networks | ||
| 653 | |a Recognition | ||
| 653 | |a Neural networks | ||
| 653 | |a User experience | ||
| 653 | |a Emotions | ||
| 653 | |a Coders | ||
| 653 | |a Datasets | ||
| 653 | |a Human-computer interaction | ||
| 653 | |a Human-computer interface | ||
| 653 | |a Computer mediated communication | ||
| 653 | |a Emotion recognition | ||
| 653 | |a Attention | ||
| 653 | |a Classification | ||
| 653 | |a Effectiveness | ||
| 653 | |a Phonetics | ||
| 653 | |a Algorithms | ||
| 653 | |a Audio data | ||
| 653 | |a Acoustics | ||
| 653 | |a Ablation | ||
| 653 | |a Interpersonal communication | ||
| 653 | |a Speech | ||
| 653 | |a Speech recognition | ||
| 653 | |a Subjectivity | ||
| 653 | |a Human technology relationship | ||
| 653 | |a Robustness | ||
| 653 | |a Models | ||
| 653 | |a Acknowledgment | ||
| 653 | |a Generalizability | ||
| 653 | |a Machinery | ||
| 653 | |a Accuracy | ||
| 700 | 1 | |a Arif, Muhammad | |
| 700 | 1 | |a Alghamdi, Manal | |
| 700 | 1 | |a Al Ghamdi, Mohammed A | |
| 700 | 1 | |a Almotiri, Sultan H | |
| 773 | 0 | |t PeerJ Computer Science |g (Nov 7, 2025) | |
| 786 | 0 | |d ProQuest |t Advanced Technologies & Aerospace Database | |
| 856 | 4 | 1 | |3 Citation/Abstract |u https://www.proquest.com/docview/3269734304/abstract/embedded/6A8EOT78XXH2IG52?source=fedsrch |
| 856 | 4 | 0 | |3 Full Text + Graphics |u https://www.proquest.com/docview/3269734304/fulltextwithgraphics/embedded/6A8EOT78XXH2IG52?source=fedsrch |