Denoising-Contrastive Alignment for Continuous Sign Language Recognition

Guardado en:
Bibliografiske detaljer
Udgivet i:arXiv.org (Dec 1, 2024), p. n/a
Hovedforfatter: Guo, Leming
Andre forfattere: Xue, Wanli, Chen, Shengyong
Udgivet:
Cornell University Library, arXiv.org
Fag:
Online adgang:Citation/Abstract
Full text outside of ProQuest
Tags: Tilføj Tag
Ingen Tags, Vær først til at tagge denne postø!

MARC

LEADER 00000nab a2200000uu 4500
001 2811074187
003 UK-CbPIL
022 |a 2331-8422 
035 |a 2811074187 
045 0 |b d20241201 
100 1 |a Guo, Leming 
245 1 |a Denoising-Contrastive Alignment for Continuous Sign Language Recognition 
260 |b Cornell University Library, arXiv.org  |c Dec 1, 2024 
513 |a Working Paper 
520 3 |a Continuous sign language recognition (CSLR) aims to recognize signs in untrimmed sign language videos to textual glosses. A key challenge of CSLR is achieving effective cross-modality alignment between video and gloss sequences to enhance video representation. However, current cross-modality alignment paradigms often neglect the role of textual grammar to guide the video representation in learning global temporal context, which adversely affects recognition performance. To tackle this limitation, we propose a Denoising-Contrastive Alignment (DCA) paradigm. DCA creatively leverages textual grammar to enhance video representations through two complementary approaches: modeling the instance correspondence between signs and glosses from a discrimination perspective and aligning their global context from a generative perspective. Specifically, DCA accomplishes flexible instance-level correspondence between signs and glosses using a contrastive loss. Building on this, DCA models global context alignment between the video and gloss sequences by denoising the gloss representation from noise, guided by video representation. Additionally, DCA introduces gradient modulation to optimize the alignment and recognition gradients, ensuring a more effective learning process. By integrating gloss-wise and global context knowledge, DCA significantly enhances video representations for CSLR tasks. Experimental results across public benchmarks validate the effectiveness of DCA and confirm its video representation enhancement feasibility. 
653 |a Encoding-Decoding 
653 |a Semantics 
653 |a Modules 
653 |a Learning 
653 |a Diffusion 
653 |a Coders 
653 |a Noise reduction 
653 |a Representations 
653 |a Benchmarks 
653 |a Optimization 
700 1 |a Xue, Wanli 
700 1 |a Chen, Shengyong 
773 0 |t arXiv.org  |g (Dec 1, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/2811074187/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2305.03614