Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of LVLMs

Guardat en:
Dades bibliogràfiques
Publicat a:arXiv.org (Jul 26, 2024), p. n/a
Autor principal: Zhang, Jie
Altres autors: Wang, Zhongqi, Mengqi Lei, Zheng, Yuan, Yan, Bei, Shan, Shiguang, Chen, Xilin
Publicat:
Cornell University Library, arXiv.org
Matèries:
Accés en línia:Citation/Abstract
Full text outside of ProQuest
Etiquetes: Afegir etiqueta
Sense etiquetes, Sigues el primer a etiquetar aquest registre!

MARC

LEADER 00000nab a2200000uu 4500
001 3073383424
003 UK-CbPIL
022 |a 2331-8422 
035 |a 3073383424 
045 0 |b d20240726 
100 1 |a Zhang, Jie 
245 1 |a Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of LVLMs 
260 |b Cornell University Library, arXiv.org  |c Jul 26, 2024 
513 |a Working Paper 
520 3 |a Currently many benchmarks have been proposed to evaluate the perception ability of the Large Vision-Language Models (LVLMs). However, most benchmarks conduct questions by selecting images from existing datasets, resulting in the potential data leakage. Besides, these benchmarks merely focus on evaluating LVLMs on the realistic style images and clean scenarios, leaving the multi-stylized images and noisy scenarios unexplored. In response to these challenges, we propose a dynamic and scalable benchmark named Dysca for evaluating LVLMs by leveraging synthesis images. Specifically, we leverage Stable Diffusion and design a rule-based method to dynamically generate novel images, questions and the corresponding answers. We consider 51 kinds of image styles and evaluate the perception capability in 20 subtasks. Moreover, we conduct evaluations under 4 scenarios (i.e., Clean, Corruption, Print Attacking and Adversarial Attacking) and 3 question types (i.e., Multi-choices, True-or-false and Free-form). Thanks to the generative paradigm, Dysca serves as a scalable benchmark for easily adding new subtasks and scenarios. A total of 8 advanced open-source LVLMs with 10 checkpoints are evaluated on Dysca, revealing the drawbacks of current LVLMs. The benchmark is released in \url{https://github.com/Benchmark-Dysca/Dysca}. 
653 |a Perception 
653 |a Questions 
653 |a Images 
653 |a Free form 
653 |a Benchmarks 
700 1 |a Wang, Zhongqi 
700 1 |a Mengqi Lei 
700 1 |a Zheng, Yuan 
700 1 |a Yan, Bei 
700 1 |a Shan, Shiguang 
700 1 |a Chen, Xilin 
773 0 |t arXiv.org  |g (Jul 26, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3073383424/abstract/embedded/J7RWLIQ9I3C9JK51?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2406.18849