Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models

সংরক্ষণ করুন:

গ্রন্থ-পঞ্জীর বিবরন
প্রকাশিত:	arXiv.org (Dec 13, 2024), p. n/a
প্রধান লেখক:	Sridhar, Arvind Krishna
অন্যান্য লেখক:	Guo, Yinyi, Visser, Erik
প্রকাশিত:	Cornell University Library, arXiv.org
বিষয়গুলি:	Language Questions Data augmentation Audio data Large language models Temporal logic Reasoning
অনলাইন ব্যবহার করুন:	Citation/Abstract Full text outside of ProQuest
ট্যাগগুলো:	ট্যাগ যুক্ত করুন কোনো ট্যাগ নেই, প্রথমজন হিসাবে ট্যাগ করুন!

MARC


LEADER	00000nab a2200000uu 4500
001	3103019342
003	UK-CbPIL
022			\|a 2331-8422
035			\|a 3103019342
045	0		\|b d20241213
100	1		\|a Sridhar, Arvind Krishna
245	1		\|a Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models
260			\|b Cornell University Library, arXiv.org \|c Dec 13, 2024
513			\|a Working Paper
520	3		\|a The Audio Question Answering (AQA) task includes audio event classification, audio captioning, and open-ended reasoning. Recently, AQA has garnered attention due to the advent of Large Audio Language Models (LALMs). Current literature focuses on constructing LALMs by integrating audio encoders with text-only Large Language Models (LLMs) through a projection module. While LALMs excel in general audio understanding, they are limited in temporal reasoning, which may hinder their commercial applications and on-device deployment. This paper addresses these challenges and limitations in audio temporal reasoning. First, we introduce a data augmentation technique for generating reliable audio temporal questions and answers using an LLM. Second, we perform a further fine-tuning of an existing baseline using curriculum learning strategy to specialize in temporal reasoning without compromising performance on fine-tuned tasks. We demonstrate the performance of our model using state-of-the-art LALMs on public audio benchmark datasets. Third, we implement our AQA model on-device locally and investigate its CPU inference for edge applications.
653			\|a Language
653			\|a Questions
653			\|a Data augmentation
653			\|a Audio data
653			\|a Large language models
653			\|a Temporal logic
653			\|a Reasoning
700	1		\|a Guo, Yinyi
700	1		\|a Visser, Erik
773	0		\|t arXiv.org \|g (Dec 13, 2024), p. n/a
786	0		\|d ProQuest \|t Engineering Database
856	4	1	\|3 Citation/Abstract \|u https://www.proquest.com/docview/3103019342/abstract/embedded/BH75TPHOCCPB476R?source=fedsrch
856	4	0	\|3 Full text outside of ProQuest \|u http://arxiv.org/abs/2409.06223