Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models

Enregistré dans:
Détails bibliographiques
Publié dans:arXiv.org (Dec 13, 2024), p. n/a
Auteur principal: Sridhar, Arvind Krishna
Autres auteurs: Guo, Yinyi, Visser, Erik
Publié:
Cornell University Library, arXiv.org
Sujets:
Accès en ligne:Citation/Abstract
Full text outside of ProQuest
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!

MARC

LEADER 00000nab a2200000uu 4500
001 3103019342
003 UK-CbPIL
022 |a 2331-8422 
035 |a 3103019342 
045 0 |b d20241213 
100 1 |a Sridhar, Arvind Krishna 
245 1 |a Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models 
260 |b Cornell University Library, arXiv.org  |c Dec 13, 2024 
513 |a Working Paper 
520 3 |a The Audio Question Answering (AQA) task includes audio event classification, audio captioning, and open-ended reasoning. Recently, AQA has garnered attention due to the advent of Large Audio Language Models (LALMs). Current literature focuses on constructing LALMs by integrating audio encoders with text-only Large Language Models (LLMs) through a projection module. While LALMs excel in general audio understanding, they are limited in temporal reasoning, which may hinder their commercial applications and on-device deployment. This paper addresses these challenges and limitations in audio temporal reasoning. First, we introduce a data augmentation technique for generating reliable audio temporal questions and answers using an LLM. Second, we perform a further fine-tuning of an existing baseline using curriculum learning strategy to specialize in temporal reasoning without compromising performance on fine-tuned tasks. We demonstrate the performance of our model using state-of-the-art LALMs on public audio benchmark datasets. Third, we implement our AQA model on-device locally and investigate its CPU inference for edge applications. 
653 |a Language 
653 |a Questions 
653 |a Data augmentation 
653 |a Audio data 
653 |a Large language models 
653 |a Temporal logic 
653 |a Reasoning 
700 1 |a Guo, Yinyi 
700 1 |a Visser, Erik 
773 0 |t arXiv.org  |g (Dec 13, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3103019342/abstract/embedded/ITVB7CEANHELVZIZ?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2409.06223