Enhancing speech emotion recognition through parallel CNNs with transformer encoder and co-attention

Na minha lista:
Detalhes bibliográficos
Publicado no:PeerJ Computer Science (Nov 7, 2025)
Autor principal: Hashem, Ahlam
Outros Autores: Arif, Muhammad, Alghamdi, Manal, Al Ghamdi, Mohammed A, Almotiri, Sultan H
Publicado em:
PeerJ, Inc.
Assuntos:
Acesso em linha:Citation/Abstract
Full Text + Graphics
Tags: Adicionar Tag
Sem tags, seja o primeiro a adicionar uma tag!

MARC

LEADER 00000nab a2200000uu 4500
001 3269734304
003 UK-CbPIL
022 |a 2376-5992 
024 7 |a 10.7717/peerj-cs.3254  |2 doi 
035 |a 3269734304 
045 0 |b d20251107 
100 1 |a Hashem, Ahlam 
245 1 |a Enhancing speech emotion recognition through parallel CNNs with transformer encoder and co-attention 
260 |b PeerJ, Inc.  |c Nov 7, 2025 
513 |a Journal Article 
520 3 |a Speech emotion recognition (SER) plays a pivotal role in enabling machines to determine human subjective emotions based only on audio information. This capability is essential for enabling effective communication and enhancing the user experience in human-computer interactions (HCI). Recent studies have successfully integrated temporal and spatial features to improve recognition accuracy. This study presents a novel approach that integrates parallel convolutional neural networks (CNNs) with a Transformer encoder and incorporates a collaborative attention mechanism (co-attention) to extract spatiotemporal features from audio samples. The proposed model is evaluated on multiple datasets and uses various fusion methods. The parallel CNNs combined with a transformer and hierarchical co-attention yield the most promising performance. In version v1 of the ASVP-ESD dataset, the proposed model achieves a weighted accuracy (WA) of 70% and an unweighted accuracy (UW) of 67%. In version 2 of the ASVP-ESD dataset, the model achieves a WA of 52% and a UW of 45%. Furthermore, the model was evaluated on the ShEMO data set to confirm its robustness and effectiveness in diverse datasets, achieving a UW of 68%. These comprehensive evaluations across multiple datasets highlight the generalizability of the proposed approach. 
653 |a Communication 
653 |a Artificial neural networks 
653 |a Recognition 
653 |a Neural networks 
653 |a User experience 
653 |a Emotions 
653 |a Coders 
653 |a Datasets 
653 |a Human-computer interaction 
653 |a Human-computer interface 
653 |a Computer mediated communication 
653 |a Emotion recognition 
653 |a Attention 
653 |a Classification 
653 |a Effectiveness 
653 |a Phonetics 
653 |a Algorithms 
653 |a Audio data 
653 |a Acoustics 
653 |a Ablation 
653 |a Interpersonal communication 
653 |a Speech 
653 |a Speech recognition 
653 |a Subjectivity 
653 |a Human technology relationship 
653 |a Robustness 
653 |a Models 
653 |a Acknowledgment 
653 |a Generalizability 
653 |a Machinery 
653 |a Accuracy 
700 1 |a Arif, Muhammad 
700 1 |a Alghamdi, Manal 
700 1 |a Al Ghamdi, Mohammed A 
700 1 |a Almotiri, Sultan H 
773 0 |t PeerJ Computer Science  |g (Nov 7, 2025) 
786 0 |d ProQuest  |t Advanced Technologies & Aerospace Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3269734304/abstract/embedded/6A8EOT78XXH2IG52?source=fedsrch 
856 4 0 |3 Full Text + Graphics  |u https://www.proquest.com/docview/3269734304/fulltextwithgraphics/embedded/6A8EOT78XXH2IG52?source=fedsrch