Audio Captioning RAG via Generative Pair-to-Pair Retrieval with Refined Knowledge Base

Đã lưu trong:

Chi tiết về thư mục
Xuất bản năm:	arXiv.org (Dec 19, 2024), p. n/a
Tác giả chính:	Choi Changin
Tác giả khác:	Lim Sungjun, Rhee Wonjong
Được phát hành:	Cornell University Library, arXiv.org
Những chủ đề:	Impact analysis Audio data Knowledge bases (artificial intelligence) Queries Information retrieval Query processing Ablation
Truy cập trực tuyến:	Citation/Abstract Full text outside of ProQuest
Các nhãn:	Thêm thẻ Không có thẻ, Là người đầu tiên thẻ bản ghi này!

MARC


LEADER	00000nab a2200000uu 4500
001	3117168571
003	UK-CbPIL
022			\|a 2331-8422
035			\|a 3117168571
045	0		\|b d20241219
100	1		\|a Choi Changin
245	1		\|a Audio Captioning RAG via Generative Pair-to-Pair Retrieval with Refined Knowledge Base
260			\|b Cornell University Library, arXiv.org \|c Dec 19, 2024
513			\|a Working Paper
520	3		\|a Recent advances in audio understanding tasks leverage the reasoning capabilities of LLMs. However, adapting LLMs to learn audio concepts requires massive training data and substantial computational resources. To address these challenges, Retrieval-Augmented Generation (RAG) retrieves audio-text pairs from a knowledge base (KB) and augments them with query audio to generate accurate textual responses. In RAG, the relevance of the retrieved information plays a crucial role in effectively processing the input. In this paper, we analyze how different retrieval methods and knowledge bases impact the relevance of audio-text pairs and the performance of audio captioning with RAG. We propose generative pair-to-pair retrieval, which uses the generated caption as a text query to accurately find relevant audio-text pairs to the query audio, thereby improving the relevance and accuracy of retrieved information. Additionally, we refine the large-scale knowledge base to retain only audio-text pairs that align with the contextualized intents. Our approach achieves state-of-the-art results on benchmarks including AudioCaps, Clotho, and Auto-ACD, with detailed ablation studies validating the effectiveness of our retrieval and KB construction methods.
653			\|a Impact analysis
653			\|a Audio data
653			\|a Knowledge bases (artificial intelligence)
653			\|a Queries
653			\|a Information retrieval
653			\|a Query processing
653			\|a Ablation
700	1		\|a Lim Sungjun
700	1		\|a Rhee Wonjong
773	0		\|t arXiv.org \|g (Dec 19, 2024), p. n/a
786	0		\|d ProQuest \|t Engineering Database
856	4	1	\|3 Citation/Abstract \|u https://www.proquest.com/docview/3117168571/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch
856	4	0	\|3 Full text outside of ProQuest \|u http://arxiv.org/abs/2410.10913