Searching for effective preprocessing method and CNN based architecture with efficient channel attention on speech emotion recognition

保存先:
書誌詳細
出版年:Scientific Reports (Nature Publisher Group) vol. 15, no. 1 (2025), p. 32689-32709
第一著者: Kim, Byunggun
その他の著者: Kwon, Younghun
出版事項:
Nature Publishing Group
主題:
オンライン・アクセス:Citation/Abstract
Full Text
Full Text - PDF
タグ: タグ追加
タグなし, このレコードへの初めてのタグを付けませんか!

MARC

LEADER 00000nab a2200000uu 4500
001 3253952784
003 UK-CbPIL
022 |a 2045-2322 
024 7 |a 10.1038/s41598-025-19887-7  |2 doi 
035 |a 3253952784 
045 2 |b d20250101  |b d20251231 
084 |a 274855  |2 nlm 
100 1 |a Kim, Byunggun  |u Department of Applied Artificial Intelligence, Hanyang University(ERICA), 425-791, Ansan, Kyunggi-Do, Republic of Korea (ROR: https://ror.org/046865y68) (GRID: grid.49606.3d) (ISNI: 0000 0001 1364 9317) 
245 1 |a Searching for effective preprocessing method and CNN based architecture with efficient channel attention on speech emotion recognition 
260 |b Nature Publishing Group  |c 2025 
513 |a Journal Article 
520 3 |a Recently, Speech emotion recognition (SER) performance has steadily increased as multiple deep learning architectures have adapted. Especially, convolutional neural network (CNN) models with spectrogram data preprocessing are the most popular approach in the SER. However, designing an effective and efficient preprocessing method and a CNN-based model for SER is still ambiguous. Therefore, it needs to search for more concrete preprocessing methods and a CNN-based model for SER. First, to search for a proper frequency-time resolution for SER, we prepare eight different datasets with preprocessing settings. Furthermore, to compensate for the lack of emotional feature resolution, we propose multiple short-term Fourier transform (STFT) preprocessing data augmentation that augments trainable data with all different sizes of windows. Next, because CNN’s channel filters are core to detecting hidden input features, we focus on the channel filters’ effectiveness on SER. To do so, we design several types of architecture that contain a 6-layer CNN model. Also, with efficient channel attention (ECA) that is well known to improve channel feature representation with only a few parameters, we find that it can more efficiently train the channel filters for SER. With two different SER datasets (Interactive Emotional Dyadic Motion Capture, Berlin Emotional Speech Database), increasing the frequency resolution in preprocessing emotional speech can improve emotion recognition performance. Consequently, the CNN-based model with only two ECA blocks can exceed the performance of previous SER models. Especially, with STFT data augmentation, our proposed model achieves the highest performance on SER. 
653 |a Emotions 
653 |a Fourier transforms 
653 |a Deep learning 
653 |a Experiments 
653 |a Neural networks 
653 |a Classification 
653 |a Speech 
653 |a Filters 
653 |a Environmental 
700 1 |a Kwon, Younghun  |u Department of Applied Artificial Intelligence, Hanyang University(ERICA), 425-791, Ansan, Kyunggi-Do, Republic of Korea (ROR: https://ror.org/046865y68) (GRID: grid.49606.3d) (ISNI: 0000 0001 1364 9317); Department of Applied Physics, Hanyang University(ERICA), 425-791, Ansan, Kyunggi-Do, Republic of Korea (ROR: https://ror.org/046865y68) (GRID: grid.49606.3d) (ISNI: 0000 0001 1364 9317) 
773 0 |t Scientific Reports (Nature Publisher Group)  |g vol. 15, no. 1 (2025), p. 32689-32709 
786 0 |d ProQuest  |t Science Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3253952784/abstract/embedded/L8HZQI7Z43R0LA5T?source=fedsrch 
856 4 0 |3 Full Text  |u https://www.proquest.com/docview/3253952784/fulltext/embedded/L8HZQI7Z43R0LA5T?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3253952784/fulltextPDF/embedded/L8HZQI7Z43R0LA5T?source=fedsrch