Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

में बचाया:
ग्रंथसूची विवरण
में प्रकाशित:arXiv.org (Mar 14, 2024), p. n/a
मुख्य लेखक: Zhan, Yufei
अन्य लेखक: Zhu, Yousong, Zhao, Hongyin, Yang, Fan, Tang, Ming, Wang, Jinqiao
प्रकाशित:
Cornell University Library, arXiv.org
विषय:
ऑनलाइन पहुंच:Citation/Abstract
Full text outside of ProQuest
टैग: टैग जोड़ें
कोई टैग नहीं, इस रिकॉर्ड को टैग करने वाले पहले व्यक्ति बनें!

MARC

LEADER 00000nab a2200000uu 4500
001 2957595560
003 UK-CbPIL
022 |a 2331-8422 
035 |a 2957595560 
045 0 |b d20240314 
100 1 |a Zhan, Yufei 
245 1 |a Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring 
260 |b Cornell University Library, arXiv.org  |c Mar 14, 2024 
513 |a Working Paper 
520 3 |a Large Vision Language Models have achieved fine-grained object perception, but the limitation of image resolution remains a significant obstacle to surpass the performance of task-specific experts in complex and dense scenarios. Such limitation further restricts the model's potential to achieve nuanced visual and language referring in domains such as GUI Agents, Counting and \etc. To address this issue, we introduce a unified high-resolution generalist model, Griffon v2, enabling flexible object referring with visual and textual prompts. To efficiently scaling up image resolution, we design a simple and lightweight down-sampling projector to overcome the input tokens constraint in Large Language Models. This design inherently preserves the complete contexts and fine details, and significantly improves multimodal perception ability especially for small objects. Building upon this, we further equip the model with visual-language co-referring capabilities through a plug-and-play visual tokenizer. It enables user-friendly interaction with flexible target images, free-form texts and even coordinates. Experiments demonstrate that Griffon v2 can localize any objects of interest with visual and textual referring, achieve state-of-the-art performance on REC, phrase grounding, and REG tasks, and outperform expert models in object detection and object counting. Data, codes and models will be released at https://github.com/jefferyZhan/Griffon. 
653 |a Language 
653 |a Perception 
653 |a Image resolution 
653 |a Large language models 
653 |a Constraint modelling 
653 |a Object recognition 
653 |a Free form 
653 |a High resolution 
700 1 |a Zhu, Yousong 
700 1 |a Zhao, Hongyin 
700 1 |a Yang, Fan 
700 1 |a Tang, Ming 
700 1 |a Wang, Jinqiao 
773 0 |t arXiv.org  |g (Mar 14, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/2957595560/abstract/embedded/J7RWLIQ9I3C9JK51?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2403.09333