Unified Multimodal Interleaved Document Representation for Retrieval

保存先:
書誌詳細
出版年:arXiv.org (Dec 16, 2024), p. n/a
第一著者: Lee, Jaewoo
その他の著者: Ko, Joonho, Baek, Jinheon, Jeong, Soyeong, Hwang, Sung Ju
出版事項:
Cornell University Library, arXiv.org
主題:
オンライン・アクセス:Citation/Abstract
Full text outside of ProQuest
タグ: タグ追加
タグなし, このレコードへの初めてのタグを付けませんか!

MARC

LEADER 00000nab a2200000uu 4500
001 3112965088
003 UK-CbPIL
022 |a 2331-8422 
035 |a 3112965088 
045 0 |b d20241216 
100 1 |a Lee, Jaewoo 
245 1 |a Unified Multimodal Interleaved Document Representation for Retrieval 
260 |b Cornell University Library, arXiv.org  |c Dec 16, 2024 
513 |a Working Paper 
520 3 |a Information Retrieval (IR) methods aim to identify documents relevant to a query, which have been widely applied in various natural language tasks. However, existing approaches typically consider only the textual content within documents, overlooking the fact that documents can contain multiple modalities, including images and tables. Also, they often segment each long document into multiple discrete passages for embedding, which prevents them from capturing the overall document context and interactions between paragraphs. To address these two challenges, we propose a method that holistically embeds documents interleaved with multiple modalities by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation. Moreover, to mitigate the information loss from segmenting documents into passages, instead of representing and retrieving passages individually, we further merge the representations of segmented passages into one single document representation, while we additionally introduce a reranking strategy to decouple and identify the relevant passage within the document if necessary. Then, through extensive experiments on diverse IR scenarios considering both the textual and multimodal queries, we show that our approach substantially outperforms relevant baselines, thanks to the consideration of the multimodal information within documents. 
653 |a Natural language processing 
653 |a Information retrieval 
653 |a Documents 
653 |a Embedding 
653 |a Representations 
653 |a Query languages 
700 1 |a Ko, Joonho 
700 1 |a Baek, Jinheon 
700 1 |a Jeong, Soyeong 
700 1 |a Hwang, Sung Ju 
773 0 |t arXiv.org  |g (Dec 16, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3112965088/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2410.02729