VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation
Збережено в:
| Опубліковано в:: | arXiv.org (Dec 1, 2024), p. n/a |
|---|---|
| Автор: | |
| Інші автори: | , , , |
| Опубліковано: |
Cornell University Library, arXiv.org
|
| Предмети: | |
| Онлайн доступ: | Citation/Abstract Full text outside of ProQuest |
| Теги: |
Немає тегів, Будьте першим, хто поставить тег для цього запису!
|
MARC
| LEADER | 00000nab a2200000uu 4500 | ||
|---|---|---|---|
| 001 | 3139001185 | ||
| 003 | UK-CbPIL | ||
| 022 | |a 2331-8422 | ||
| 035 | |a 3139001185 | ||
| 045 | 0 | |b d20241201 | |
| 100 | 1 | |a Ren, Weiming | |
| 245 | 1 | |a VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation | |
| 260 | |b Cornell University Library, arXiv.org |c Dec 1, 2024 | ||
| 513 | |a Working Paper | ||
| 520 | 3 | |a Current large multimodal models (LMMs) face significant challenges in processing and comprehending long-duration or high-resolution videos, which is mainly due to the lack of high-quality datasets. To address this issue from a data-centric perspective, we propose VISTA, a simple yet effective Video Spatiotemporal Augmentation framework that synthesizes long-duration and high-resolution video instruction-following pairs from existing video-caption datasets. VISTA spatially and temporally combines videos to create new synthetic videos with extended durations and enhanced resolutions, and subsequently produces question-answer pairs pertaining to these newly synthesized videos. Based on this paradigm, we develop seven video augmentation methods and curate VISTA-400K, a video instruction-following dataset aimed at enhancing long-duration and high-resolution video understanding. Finetuning various video LMMs on our data resulted in an average improvement of 3.3% across four challenging benchmarks for long-video understanding. Furthermore, we introduce the first comprehensive high-resolution video understanding benchmark HRVideoBench, on which our finetuned models achieve a 6.5% performance gain. These results highlight the effectiveness of our framework. | |
| 653 | |a Datasets | ||
| 653 | |a Data augmentation | ||
| 653 | |a Video data | ||
| 653 | |a Synthesis | ||
| 653 | |a Spatiotemporal data | ||
| 653 | |a Benchmarks | ||
| 653 | |a High resolution | ||
| 653 | |a Effectiveness | ||
| 700 | 1 | |a Yang, Huan | |
| 700 | 1 | |a Min, Jie | |
| 700 | 1 | |a Cong, Wei | |
| 700 | 1 | |a Chen, Wenhu | |
| 773 | 0 | |t arXiv.org |g (Dec 1, 2024), p. n/a | |
| 786 | 0 | |d ProQuest |t Engineering Database | |
| 856 | 4 | 1 | |3 Citation/Abstract |u https://www.proquest.com/docview/3139001185/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch |
| 856 | 4 | 0 | |3 Full text outside of ProQuest |u http://arxiv.org/abs/2412.00927 |