AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation

Αποθηκεύτηκε σε:
Λεπτομέρειες βιβλιογραφικής εγγραφής
Εκδόθηκε σε:arXiv.org (Dec 13, 2024), p. n/a
Κύριος συγγραφέας: Gao, Xiyuan
Άλλοι συγγραφείς: Bansal, Shubhi, Gowda, Kushaan, Zhu, Li, Nayak, Shekhar, Kumar, Nagendra, Coler, Matt
Έκδοση:
Cornell University Library, arXiv.org
Θέματα:
Διαθέσιμο Online:Citation/Abstract
Full text outside of ProQuest
Ετικέτες: Προσθήκη ετικέτας
Δεν υπάρχουν, Καταχωρήστε ετικέτα πρώτοι!

MARC

LEADER 00000nab a2200000uu 4500
001 3145273898
003 UK-CbPIL
022 |a 2331-8422 
035 |a 3145273898 
045 0 |b d20241213 
100 1 |a Gao, Xiyuan 
245 1 |a AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation 
260 |b Cornell University Library, arXiv.org  |c Dec 13, 2024 
513 |a Working Paper 
520 3 |a Detecting sarcasm effectively requires a nuanced understanding of context, including vocal tones and facial expressions. The progression towards multimodal computational methods in sarcasm detection, however, faces challenges due to the scarcity of data. To address this, we present AMuSeD (Attentive deep neural network for MUltimodal Sarcasm dEtection incorporating bi-modal Data augmentation). This approach utilizes the Multimodal Sarcasm Detection Dataset (MUStARD) and introduces a two-phase bimodal data augmentation strategy. The first phase involves generating varied text samples through Back Translation from several secondary languages. The second phase involves the refinement of a FastSpeech 2-based speech synthesis system, tailored specifically for sarcasm to retain sarcastic intonations. Alongside a cloud-based Text-to-Speech (TTS) service, this Fine-tuned FastSpeech 2 system produces corresponding audio for the text augmentations. We also investigate various attention mechanisms for effectively merging text and audio data, finding self-attention to be the most efficient for bimodal integration. Our experiments reveal that this combined augmentation and attention approach achieves a significant F1-score of 81.0% in text-audio modalities, surpassing even models that use three modalities from the MUStARD dataset. 
653 |a Datasets 
653 |a Data augmentation 
653 |a Audio data 
653 |a Modal data 
653 |a Mustard 
653 |a Artificial neural networks 
653 |a Neural networks 
653 |a Speech recognition 
700 1 |a Bansal, Shubhi 
700 1 |a Gowda, Kushaan 
700 1 |a Zhu, Li 
700 1 |a Nayak, Shekhar 
700 1 |a Kumar, Nagendra 
700 1 |a Coler, Matt 
773 0 |t arXiv.org  |g (Dec 13, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3145273898/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2412.10103