Visual Lexicon: Rich Image Features in Language Space

Uloženo v:
Podrobná bibliografie
Vydáno v:arXiv.org (Dec 9, 2024), p. n/a
Hlavní autor: Wang, XuDong
Další autoři: Zhou, Xingyi, Fathi, Alireza, Darrell, Trevor, Schmid, Cordelia
Vydáno:
Cornell University Library, arXiv.org
Témata:
On-line přístup:Citation/Abstract
Full text outside of ProQuest
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

MARC

LEADER 00000nab a2200000uu 4500
001 3142730777
003 UK-CbPIL
022 |a 2331-8422 
035 |a 3142730777 
045 0 |b d20241209 
100 1 |a Wang, XuDong 
245 1 |a Visual Lexicon: Rich Image Features in Language Space 
260 |b Cornell University Library, arXiv.org  |c Dec 9, 2024 
513 |a Working Paper 
520 3 |a We present Visual Lexicon, a novel visual language that encodes rich image information into the text space of vocabulary tokens while retaining intricate visual details that are often challenging to convey in natural language. Unlike traditional methods that prioritize either high-level semantics (e.g., CLIP) or pixel-level reconstruction (e.g., VAE), ViLex simultaneously captures rich semantic content and fine visual details, enabling high-quality image generation and comprehensive visual scene understanding. Through a self-supervised learning pipeline, ViLex generates tokens optimized for reconstructing input images using a frozen text-to-image (T2I) diffusion model, preserving the detailed information necessary for high-fidelity semantic-level reconstruction. As an image embedding in the language space, ViLex tokens leverage the compositionality of natural languages, allowing them to be used independently as "text tokens" or combined with natural language tokens to prompt pretrained T2I models with both visual and textual inputs, mirroring how we interact with vision-language models (VLMs). Experiments demonstrate that ViLex achieves higher fidelity in image reconstruction compared to text embeddings--even with a single ViLex token. Moreover, ViLex successfully performs various DreamBooth tasks in a zero-shot, unsupervised manner without fine-tuning T2I models. Additionally, ViLex serves as a powerful vision encoder, consistently improving vision-language model performance across 15 benchmarks relative to a strong SigLIP baseline. 
653 |a Language 
653 |a Self-supervised learning 
653 |a Semantics 
653 |a Vision 
653 |a Image quality 
653 |a Image reconstruction 
653 |a Natural language processing 
653 |a Scene analysis 
653 |a Image processing 
653 |a Natural language 
700 1 |a Zhou, Xingyi 
700 1 |a Fathi, Alireza 
700 1 |a Darrell, Trevor 
700 1 |a Schmid, Cordelia 
773 0 |t arXiv.org  |g (Dec 9, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3142730777/abstract/embedded/6A8EOT78XXH2IG52?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2412.06774