PAR: Prompt-Aware Token Reduction Method for Efficient Large Multimodal Models

Na minha lista:
Detalhes bibliográficos
Publicado no:arXiv.org (Dec 2, 2024), p. n/a
Autor principal: Liu, Yingen
Outros Autores: Wu, Fan, Li, Ruihui, Tang, Zhuo, Li, Kenli
Publicado em:
Cornell University Library, arXiv.org
Assuntos:
Acesso em linha:Citation/Abstract
Full text outside of ProQuest
Tags: Adicionar Tag
Sem tags, seja o primeiro a adicionar uma tag!

MARC

LEADER 00000nab a2200000uu 4500
001 3115596955
003 UK-CbPIL
022 |a 2331-8422 
035 |a 3115596955 
045 0 |b d20241202 
100 1 |a Liu, Yingen 
245 1 |a PAR: Prompt-Aware Token Reduction Method for Efficient Large Multimodal Models 
260 |b Cornell University Library, arXiv.org  |c Dec 2, 2024 
513 |a Working Paper 
520 3 |a Multimodal large language models (MLLMs) demonstrate strong performance across visual tasks, but their efficiency is hindered by significant computational and memory demands from processing long contexts in multimodal inputs. To address this, we introduce PAR (Prompt-Aware Token Reduction), a novel and plug-and-play approach that reduces visual tokens efficiently without compromising model performance. Unlike previous methods that rely heavily on attention mechanisms and overlooking cross-modal interactions , we uses a prompt-aware strategy to adpative identify and cluster essential visual tokens. PAR categorizes visual context redundancy into two types: external and internal. External redundancy is minimized through semantic retrieval, while internal redundancy is addressed using a token routing mechanism. This method substantially reduces computational load without requiring additional training or complex architectural modifications. \textbf{Experimental results demonstrate that across various visual question answering tasks, PAR reduces FLOPs by 83\% with a compression ratio of 89\%, while retaining 97\% of baseline accuracy.} The adaptive design of PAR achieves a 2x token reduction ratio compared to prior approaches, enabling a better balance between performance and efficiency. 
653 |a Visual tasks 
653 |a Memory tasks 
653 |a Semantics 
653 |a Large language models 
700 1 |a Wu, Fan 
700 1 |a Li, Ruihui 
700 1 |a Tang, Zhuo 
700 1 |a Li, Kenli 
773 0 |t arXiv.org  |g (Dec 2, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3115596955/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2410.07278