Deepfake Voice Detection: An Approach Using End-to-End Transformer with Acoustic Feature Fusion by Cross-Attention

Guardado en:
Detalles Bibliográficos
Publicado en:Electronics vol. 14, no. 10 (2025), p. 2040
Autor principal: Gong Liang Yu
Otros Autores: Li, Xue Jun
Publicado:
MDPI AG
Materias:
Acceso en línea:Citation/Abstract
Full Text + Graphics
Full Text - PDF
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!

MARC

LEADER 00000nab a2200000uu 4500
001 3211940511
003 UK-CbPIL
022 |a 2079-9292 
024 7 |a 10.3390/electronics14102040  |2 doi 
035 |a 3211940511 
045 2 |b d20250101  |b d20251231 
084 |a 231458  |2 nlm 
100 1 |a Gong Liang Yu 
245 1 |a Deepfake Voice Detection: An Approach Using End-to-End Transformer with Acoustic Feature Fusion by Cross-Attention 
260 |b MDPI AG  |c 2025 
513 |a Journal Article 
520 3 |a Deepfake technology uses artificial intelligence to create highly realistic but fake audio, video, or images, often making it difficult to distinguish from real content. Due to its potential use for misinformation, fraud, and identity theft, deepfake technology has gained a bad reputation in the digital world. Recently, many works have reported on the detection of deepfake videos/images. However, few studies have concentrated on developing robust deepfake voice detection systems. Among most existing studies in this field, a deepfake voice detection system commonly requires a large amount of training data and a robust backbone to detect real and logistic attack audio. For acoustic feature extractions, Mel-frequency Filter Bank (MFB)-based approaches are more suitable for extracting speech signals than applying the raw spectrum as input. Recurrent Neural Networks (RNNs) have been successfully applied to Natural Language Processing (NLP), but these backbones suffer from gradient vanishing or explosion while processing long-term sequences. In addition, the cross-dataset evaluation of most deepfake voice recognition systems has weak performance, leading to a system robustness issue. To address these issues, we propose an acoustic feature-fusion method to combine Mel-spectrum and pitch representation based on cross-attention mechanisms. Then, we combine a Transformer encoder with a convolutional neural network block to extract global and local features as a front end. Finally, we connect the back end with one linear layer for classification. We summarized several deepfake voice detectors’ performances on the silence-segment processed ASVspoof 2019 dataset. Our proposed method can achieve an Equal Error Rate (EER) of 26.41%, while most of the existing methods result in EER higher than 30%. We also tested our proposed method on the ASVspoof 2021 dataset, and found that it can achieve an EER as low as 28.52%, while the EER values for existing methods are all higher than 28.9%. 
653 |a Forgery 
653 |a Datasets 
653 |a Deepfake 
653 |a Voice recognition 
653 |a Fourier transforms 
653 |a Artificial neural networks 
653 |a Sensors 
653 |a Frequency filters 
653 |a Deception 
653 |a Theft 
653 |a Recurrent neural networks 
653 |a Speeches 
653 |a False information 
653 |a Algorithms 
653 |a Voice activity detectors 
653 |a Filter banks 
653 |a Audio data 
653 |a Artificial intelligence 
653 |a Acoustics 
653 |a Natural language processing 
653 |a Robustness 
653 |a Speech recognition 
653 |a Fraud 
700 1 |a Li, Xue Jun 
773 0 |t Electronics  |g vol. 14, no. 10 (2025), p. 2040 
786 0 |d ProQuest  |t Advanced Technologies & Aerospace Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3211940511/abstract/embedded/75I98GEZK8WCJMPQ?source=fedsrch 
856 4 0 |3 Full Text + Graphics  |u https://www.proquest.com/docview/3211940511/fulltextwithgraphics/embedded/75I98GEZK8WCJMPQ?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3211940511/fulltextPDF/embedded/75I98GEZK8WCJMPQ?source=fedsrch