Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent

Enregistré dans:
Détails bibliographiques
Publié dans:arXiv.org (Dec 8, 2024), p. n/a
Auteur principal: Li, Yangning
Autres auteurs: Li, Yinghui, Wang, Xinyu, Jiang, Yong, Zhang, Zhen, Zheng, Xinran, Wang, Hui, Hai-Tao, Zheng, Xie, Pengjun, Yu, Philip S, Huang, Fei, Zhou, Jingren
Publié:
Cornell University Library, arXiv.org
Sujets:
Accès en ligne:Citation/Abstract
Full text outside of ProQuest
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!

MARC

LEADER 00000nab a2200000uu 4500
001 3125867191
003 UK-CbPIL
022 |a 2331-8422 
035 |a 3125867191 
045 0 |b d20241208 
100 1 |a Li, Yangning 
245 1 |a Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent 
260 |b Cornell University Library, arXiv.org  |c Dec 8, 2024 
513 |a Working Paper 
520 3 |a Multimodal Retrieval Augmented Generation (mRAG) plays an important role in mitigating the "hallucination" issue inherent in multimodal large language models (MLLMs). Although promising, existing heuristic mRAGs typically predefined fixed retrieval processes, which causes two issues: (1) Non-adaptive Retrieval Queries. (2) Overloaded Retrieval Queries. However, these flaws cannot be adequately reflected by current knowledge-seeking visual question answering (VQA) datasets, since the most required knowledge can be readily obtained with a standard two-step retrieval. To bridge the dataset gap, we first construct Dyn-VQA dataset, consisting of three types of "dynamic" questions, which require complex knowledge retrieval strategies variable in query, tool, and time: (1) Questions with rapidly changing answers. (2) Questions requiring multi-modal knowledge. (3) Multi-hop questions. Experiments on Dyn-VQA reveal that existing heuristic mRAGs struggle to provide sufficient and precisely relevant knowledge for dynamic questions due to their rigid retrieval processes. Hence, we further propose the first self-adaptive planning agent for multimodal retrieval, OmniSearch. The underlying idea is to emulate the human behavior in question solution which dynamically decomposes complex multimodal questions into sub-question chains with retrieval action. Extensive experiments prove the effectiveness of our OmniSearch, also provide direction for advancing mRAG. The code and dataset will be open-sourced at https://github.com/Alibaba-NLP/OmniSearch. 
653 |a Heuristic 
653 |a Datasets 
653 |a Questions 
653 |a Large language models 
653 |a Query processing 
653 |a Retrieval 
700 1 |a Li, Yinghui 
700 1 |a Wang, Xinyu 
700 1 |a Jiang, Yong 
700 1 |a Zhang, Zhen 
700 1 |a Zheng, Xinran 
700 1 |a Wang, Hui 
700 1 |a Hai-Tao, Zheng 
700 1 |a Xie, Pengjun 
700 1 |a Yu, Philip S 
700 1 |a Huang, Fei 
700 1 |a Zhou, Jingren 
773 0 |t arXiv.org  |g (Dec 8, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3125867191/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2411.02937