Multi-Modal Prototypes for Open-World Semantic Segmentation

Guardat en:
Dades bibliogràfiques
Publicat a:arXiv.org (Jul 11, 2024), p. n/a
Autor principal: Yang, Yuhuan
Altres autors: Ma, Chaofan, Chen, Ju, Zhang, Fei, Yao, Jiangchao, Zhang, Ya, Wang, Yanfeng
Publicat:
Cornell University Library, arXiv.org
Matèries:
Accés en línia:Citation/Abstract
Full text outside of ProQuest
Etiquetes: Afegir etiqueta
Sense etiquetes, Sigues el primer a etiquetar aquest registre!

MARC

LEADER 00000nab a2200000uu 4500
001 2833812687
003 UK-CbPIL
022 |a 2331-8422 
035 |a 2833812687 
045 0 |b d20240711 
100 1 |a Yang, Yuhuan 
245 1 |a Multi-Modal Prototypes for Open-World Semantic Segmentation 
260 |b Cornell University Library, arXiv.org  |c Jul 11, 2024 
513 |a Working Paper 
520 3 |a In semantic segmentation, generalizing a visual system to both seen categories and novel categories at inference time has always been practically valuable yet challenging. To enable such functionality, existing methods mainly rely on either providing several support demonstrations from the visual aspect or characterizing the informative clues from the textual aspect (e.g., the class names). Nevertheless, both two lines neglect the complementary intrinsic of low-level visual and high-level language information, while the explorations that consider visual and textual modalities as a whole to promote predictions are still limited. To close this gap, we propose to encompass textual and visual clues as multi-modal prototypes to allow more comprehensive support for open-world semantic segmentation, and build a novel prototype-based segmentation framework to realize this promise. To be specific, unlike the straightforward combination of bi-modal clues, we decompose the high-level language information as multi-aspect prototypes and aggregate the low-level visual information as more semantic prototypes, on basis of which, a fine-grained complementary fusion makes the multi-modal prototypes more powerful and accurate to promote the prediction. Based on an elastic mask prediction module that permits any number and form of prototype inputs, we are able to solve the zero-shot, few-shot and generalized counterpart tasks in one architecture. Extensive experiments on both PASCAL-\(5^i\) and COCO-\(20^i\) datasets show the consistent superiority of the proposed method compared with the previous state-of-the-art approaches, and a range of ablation studies thoroughly dissects each component in our framework both quantitatively and qualitatively that verify their effectiveness. 
653 |a Prototypes 
653 |a Datasets 
653 |a Cues 
653 |a High level languages 
653 |a Semantics 
653 |a Semantic segmentation 
653 |a Pascal (programming language) 
653 |a Ablation 
700 1 |a Ma, Chaofan 
700 1 |a Chen, Ju 
700 1 |a Zhang, Fei 
700 1 |a Yao, Jiangchao 
700 1 |a Zhang, Ya 
700 1 |a Wang, Yanfeng 
773 0 |t arXiv.org  |g (Jul 11, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/2833812687/abstract/embedded/H09TXR3UUZB2ISDL?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2307.02003