Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org (Jun 15, 2024), p. n/a
1. Verfasser: Luo, Jiayun
Weitere Verfasser: Khandelwal, Siddhesh, Sigal, Leonid, Li, Boyang
Veröffentlicht:
Cornell University Library, arXiv.org
Schlagworte:
Online-Zugang:Citation/Abstract
Full text outside of ProQuest
Tags: Tag hinzufügen
Keine Tags, Fügen Sie das erste Tag hinzu!

MARC

LEADER 00000nab a2200000uu 4500
001 3028036359
003 UK-CbPIL
022 |a 2331-8422 
035 |a 3028036359 
045 0 |b d20240615 
100 1 |a Luo, Jiayun 
245 1 |a Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models 
260 |b Cornell University Library, arXiv.org  |c Jun 15, 2024 
513 |a Working Paper 
520 3 |a From image-text pairs, large-scale vision-language models (VLMs) learn to implicitly associate image regions with words, which prove effective for tasks like visual question answering. However, leveraging the learned association for open-vocabulary semantic segmentation remains a challenge. In this paper, we propose a simple, yet extremely effective, training-free technique, Plug-and-Play Open-Vocabulary Semantic Segmentation (PnP-OVSS) for this task. PnP-OVSS leverages a VLM with direct text-to-image cross-attention and an image-text matching loss. To balance between over-segmentation and under-segmentation, we introduce Salience Dropout; by iteratively dropping patches that the model is most attentive to, we are able to better resolve the entire extent of the segmentation mask. PnP-OVSS does not require any neural network training and performs hyperparameter tuning without the need for any segmentation annotations, even for a validation set. PnP-OVSS demonstrates substantial improvements over comparable baselines (+26.2% mIoU on Pascal VOC, +20.5% mIoU on MS COCO, +3.1% mIoU on COCO Stuff and +3.0% mIoU on ADE20K). Our codebase is at https://github.com/letitiabanana/PnP-OVSS. 
653 |a Visual tasks 
653 |a Vision 
653 |a Semantic segmentation 
653 |a Neural networks 
653 |a Image segmentation 
653 |a Pascal (programming language) 
653 |a Semantics 
700 1 |a Khandelwal, Siddhesh 
700 1 |a Sigal, Leonid 
700 1 |a Li, Boyang 
773 0 |t arXiv.org  |g (Jun 15, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3028036359/abstract/embedded/L8HZQI7Z43R0LA5T?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2311.17095