Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models
Gespeichert in:
| Veröffentlicht in: | arXiv.org (Jun 15, 2024), p. n/a |
|---|---|
| 1. Verfasser: | |
| Weitere Verfasser: | , , |
| Veröffentlicht: |
Cornell University Library, arXiv.org
|
| Schlagworte: | |
| Online-Zugang: | Citation/Abstract Full text outside of ProQuest |
| Tags: |
Keine Tags, Fügen Sie das erste Tag hinzu!
|
MARC
| LEADER | 00000nab a2200000uu 4500 | ||
|---|---|---|---|
| 001 | 3028036359 | ||
| 003 | UK-CbPIL | ||
| 022 | |a 2331-8422 | ||
| 035 | |a 3028036359 | ||
| 045 | 0 | |b d20240615 | |
| 100 | 1 | |a Luo, Jiayun | |
| 245 | 1 | |a Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models | |
| 260 | |b Cornell University Library, arXiv.org |c Jun 15, 2024 | ||
| 513 | |a Working Paper | ||
| 520 | 3 | |a From image-text pairs, large-scale vision-language models (VLMs) learn to implicitly associate image regions with words, which prove effective for tasks like visual question answering. However, leveraging the learned association for open-vocabulary semantic segmentation remains a challenge. In this paper, we propose a simple, yet extremely effective, training-free technique, Plug-and-Play Open-Vocabulary Semantic Segmentation (PnP-OVSS) for this task. PnP-OVSS leverages a VLM with direct text-to-image cross-attention and an image-text matching loss. To balance between over-segmentation and under-segmentation, we introduce Salience Dropout; by iteratively dropping patches that the model is most attentive to, we are able to better resolve the entire extent of the segmentation mask. PnP-OVSS does not require any neural network training and performs hyperparameter tuning without the need for any segmentation annotations, even for a validation set. PnP-OVSS demonstrates substantial improvements over comparable baselines (+26.2% mIoU on Pascal VOC, +20.5% mIoU on MS COCO, +3.1% mIoU on COCO Stuff and +3.0% mIoU on ADE20K). Our codebase is at https://github.com/letitiabanana/PnP-OVSS. | |
| 653 | |a Visual tasks | ||
| 653 | |a Vision | ||
| 653 | |a Semantic segmentation | ||
| 653 | |a Neural networks | ||
| 653 | |a Image segmentation | ||
| 653 | |a Pascal (programming language) | ||
| 653 | |a Semantics | ||
| 700 | 1 | |a Khandelwal, Siddhesh | |
| 700 | 1 | |a Sigal, Leonid | |
| 700 | 1 | |a Li, Boyang | |
| 773 | 0 | |t arXiv.org |g (Jun 15, 2024), p. n/a | |
| 786 | 0 | |d ProQuest |t Engineering Database | |
| 856 | 4 | 1 | |3 Citation/Abstract |u https://www.proquest.com/docview/3028036359/abstract/embedded/L8HZQI7Z43R0LA5T?source=fedsrch |
| 856 | 4 | 0 | |3 Full text outside of ProQuest |u http://arxiv.org/abs/2311.17095 |