VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

Збережено в:
Бібліографічні деталі
Опубліковано в::arXiv.org (Dec 1, 2024), p. n/a
Автор: Ren, Weiming
Інші автори: Yang, Huan, Min, Jie, Cong, Wei, Chen, Wenhu
Опубліковано:
Cornell University Library, arXiv.org
Предмети:
Онлайн доступ:Citation/Abstract
Full text outside of ProQuest
Теги: Додати тег
Немає тегів, Будьте першим, хто поставить тег для цього запису!

MARC

LEADER 00000nab a2200000uu 4500
001 3139001185
003 UK-CbPIL
022 |a 2331-8422 
035 |a 3139001185 
045 0 |b d20241201 
100 1 |a Ren, Weiming 
245 1 |a VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation 
260 |b Cornell University Library, arXiv.org  |c Dec 1, 2024 
513 |a Working Paper 
520 3 |a Current large multimodal models (LMMs) face significant challenges in processing and comprehending long-duration or high-resolution videos, which is mainly due to the lack of high-quality datasets. To address this issue from a data-centric perspective, we propose VISTA, a simple yet effective Video Spatiotemporal Augmentation framework that synthesizes long-duration and high-resolution video instruction-following pairs from existing video-caption datasets. VISTA spatially and temporally combines videos to create new synthetic videos with extended durations and enhanced resolutions, and subsequently produces question-answer pairs pertaining to these newly synthesized videos. Based on this paradigm, we develop seven video augmentation methods and curate VISTA-400K, a video instruction-following dataset aimed at enhancing long-duration and high-resolution video understanding. Finetuning various video LMMs on our data resulted in an average improvement of 3.3% across four challenging benchmarks for long-video understanding. Furthermore, we introduce the first comprehensive high-resolution video understanding benchmark HRVideoBench, on which our finetuned models achieve a 6.5% performance gain. These results highlight the effectiveness of our framework. 
653 |a Datasets 
653 |a Data augmentation 
653 |a Video data 
653 |a Synthesis 
653 |a Spatiotemporal data 
653 |a Benchmarks 
653 |a High resolution 
653 |a Effectiveness 
700 1 |a Yang, Huan 
700 1 |a Min, Jie 
700 1 |a Cong, Wei 
700 1 |a Chen, Wenhu 
773 0 |t arXiv.org  |g (Dec 1, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3139001185/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2412.00927