Phoneme-Aware Augmentation for Robust Cantonese ASR Under Low-Resource Conditions

Guardado en:
Detalles Bibliográficos
Publicado en:Symmetry vol. 17, no. 9 (2025), p. 1478-1496
Autor principal: Zhang Lusheng
Otros Autores: Wu, Shie, Wang Zhongxun
Publicado:
MDPI AG
Materias:
Acceso en línea:Citation/Abstract
Full Text + Graphics
Full Text - PDF
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!

MARC

LEADER 00000nab a2200000uu 4500
001 3254649183
003 UK-CbPIL
022 |a 2073-8994 
024 7 |a 10.3390/sym17091478  |2 doi 
035 |a 3254649183 
045 2 |b d20250101  |b d20251231 
084 |a 231635  |2 nlm 
100 1 |a Zhang Lusheng  |u School of Physics and Electronic Information, Yantai University, Yantai 264005, China; ytdxeduzls@s.ytu.edu.cn (L.Z.); wushie@ytu.edu.cn (S.W.) 
245 1 |a Phoneme-Aware Augmentation for Robust Cantonese ASR Under Low-Resource Conditions 
260 |b MDPI AG  |c 2025 
513 |a Journal Article 
520 3 |a Cantonese automatic speech recognition (ASR) faces persistent challenges due to its nine lexical tones, extensive phonological variation, and the scarcity of professionally transcribed corpora. To address these issues, we propose a lightweight and data-efficient framework that leverages weak phonetic supervision (WPS) in conjunction with two pho-neme-aware augmentation strategies. (1) Dynamic Boundary-Aligned Phoneme Dropout progressively removes entire IPA segments according to a curriculum schedule, simulating real-world phenomena such as elision, lenition, and tonal drift while ensuring training stability. (2) Phoneme-Aware SpecAugment confines all time- and frequency-masking operations within phoneme boundaries and prioritizes high-attention regions, thereby preserving intra-phonemic contours and formant integrity. Built on the Whistle encoder—which integrates a Conformer backbone, Connectionist Temporal Classification–Conditional Random Field (CTC-CRF) alignment, and a multi-lingual phonetic space—the approach requires only a grapheme-to-phoneme lexicon and Montreal Forced Aligner outputs, without any additional manual labeling. Experiments on the Cantonese subset of Common Voice demonstrate consistent gains: Dynamic Dropout alone reduces phoneme error rate (PER) from 17.8% to 16.7% with 50 h of speech and 16.4% to 15.1% with 100 h, while the combination of the two augmentations further lowers PER to 15.9%/14.4%. These results confirm that structure-aware phoneme-level perturbations provide an effective and low-cost solution for building robust Cantonese ASR systems under low-resource conditions. 
653 |a Error analysis 
653 |a Accuracy 
653 |a Phonology 
653 |a Conditional random fields 
653 |a Cantonese 
653 |a Phonetics 
653 |a Phonemes 
653 |a Masking 
653 |a Phonemics 
653 |a Voice recognition 
653 |a Supervision 
653 |a Tone 
653 |a Speech recognition 
653 |a Multilingualism 
653 |a Robustness (mathematics) 
653 |a Annotations 
653 |a Acoustics 
653 |a Automatic speech recognition 
653 |a Speech 
653 |a Grapheme phoneme correspondence 
653 |a Chinese languages 
653 |a Reduction (Phonological or Phonetic) 
653 |a Semantics 
653 |a Cultural heritage 
653 |a Experiments 
653 |a Dropping out 
653 |a Classification 
653 |a Scarcity 
653 |a Contours 
653 |a Augmentation 
653 |a Morality 
653 |a Curricula 
700 1 |a Wu, Shie  |u School of Physics and Electronic Information, Yantai University, Yantai 264005, China; ytdxeduzls@s.ytu.edu.cn (L.Z.); wushie@ytu.edu.cn (S.W.) 
700 1 |a Wang Zhongxun  |u School of Physics and Electronic Information, Yantai University, Yantai 264005, China; ytdxeduzls@s.ytu.edu.cn (L.Z.); wushie@ytu.edu.cn (S.W.) 
773 0 |t Symmetry  |g vol. 17, no. 9 (2025), p. 1478-1496 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3254649183/abstract/embedded/L8HZQI7Z43R0LA5T?source=fedsrch 
856 4 0 |3 Full Text + Graphics  |u https://www.proquest.com/docview/3254649183/fulltextwithgraphics/embedded/L8HZQI7Z43R0LA5T?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3254649183/fulltextPDF/embedded/L8HZQI7Z43R0LA5T?source=fedsrch