Enhancing Robustness of CLIP to Common Corruptions through Bimodal Test-Time Adaptation

שמור ב:
מידע ביבליוגרפי
הוצא לאור ב:arXiv.org (Dec 3, 2024), p. n/a
מחבר ראשי: Maharana, Sarthak Kumar
מחברים אחרים: Zhang, Baoming, Karlinsky, Leonid, Feris, Rogerio, Guo, Yunhui
יצא לאור:
Cornell University Library, arXiv.org
נושאים:
גישה מקוונת:Citation/Abstract
Full text outside of ProQuest
תגים: הוספת תג
אין תגיות, היה/י הראשונ/ה לתייג את הרשומה!

MARC

LEADER 00000nab a2200000uu 4500
001 3141255158
003 UK-CbPIL
022 |a 2331-8422 
035 |a 3141255158 
045 0 |b d20241203 
100 1 |a Maharana, Sarthak Kumar 
245 1 |a Enhancing Robustness of CLIP to Common Corruptions through Bimodal Test-Time Adaptation 
260 |b Cornell University Library, arXiv.org  |c Dec 3, 2024 
513 |a Working Paper 
520 3 |a Although open-vocabulary classification models like Contrastive Language Image Pretraining (CLIP) have demonstrated strong zero-shot learning capabilities, their robustness to common image corruptions remains poorly understood. Through extensive experiments, we show that zero-shot CLIP lacks robustness to common image corruptions at increasing severity levels during test-time, necessitating the adaptation of CLIP to unlabeled corrupted images using test-time adaptation (TTA). However, we found that existing TTA methods have severe limitations in adapting CLIP due to their unimodal nature. To address these limitations, we propose \framework, a bimodal TTA method specially designed to improve CLIP's robustness to common image corruptions. The key insight of our approach is not only to adapt the visual encoders for better image feature extraction but also to strengthen the alignment between image and text features by promoting a stronger association between the image class prototype, computed using pseudo-labels, and the corresponding text feature. We evaluate our approach on benchmark image corruption datasets and achieve state-of-the-art results in TTA for CLIP, specifically for domains involving image corruption. Particularly, with a ViT-B/16 vision backbone, we obtain mean accuracy improvements of 9.7%, 5.94%, and 5.12% for CIFAR-10C, CIFAR-100C, and ImageNet-C, respectively. 
653 |a Adaptation 
653 |a Feature extraction 
653 |a Testing time 
653 |a Image enhancement 
653 |a Zero-shot learning 
653 |a Corruption 
653 |a Robustness 
700 1 |a Zhang, Baoming 
700 1 |a Karlinsky, Leonid 
700 1 |a Feris, Rogerio 
700 1 |a Guo, Yunhui 
773 0 |t arXiv.org  |g (Dec 3, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3141255158/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2412.02837