T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs

Сохранить в:
Библиографические подробности
Опубликовано в::arXiv.org (Dec 2, 2024), p. n/a
Главный автор: Yin, Shukang
Другие авторы: Fu, Chaoyou, Zhao, Sirui, Shen, Yunhang, Ge, Chunjiang, Yang, Yan, Long, Zuwei, Dai, Yuhan, Xu, Tong, Sun, Xing, He, Ran, Caifeng Shan, Chen, Enhong
Опубликовано:
Cornell University Library, arXiv.org
Предметы:
Online-ссылка:Citation/Abstract
Full text outside of ProQuest
Метки: Добавить метку
Нет меток, Требуется 1-ая метка записи!

MARC

LEADER 00000nab a2200000uu 4500
001 3138994644
003 UK-CbPIL
022 |a 2331-8422 
035 |a 3138994644 
045 0 |b d20241202 
100 1 |a Yin, Shukang 
245 1 |a T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs 
260 |b Cornell University Library, arXiv.org  |c Dec 2, 2024 
513 |a Working Paper 
520 3 |a The success of Multimodal Large Language Models (MLLMs) in the image domain has garnered wide attention from the research community. Drawing on previous successful experiences, researchers have recently explored extending the success to the video understanding realms. Apart from training from scratch, an efficient way is to utilize the pre-trained image-LLMs, leading to two mainstream approaches, i.e. zero-shot inference and further fine-tuning with video data. In this work, our study of these approaches harvests an effective data augmentation method. We first make a deeper inspection of the zero-shot inference way and identify two limitations, i.e. limited generalization and lack of temporal understanding capabilities. Thus, we further investigate the fine-tuning approach and find a low learning efficiency when simply using all the video data samples, which can be attributed to a lack of instruction diversity. Aiming at this issue, we develop a method called T2Vid to synthesize video-like samples to enrich the instruction diversity in the training corpus. Integrating these data enables a simple and efficient training scheme, which achieves performance comparable to or even superior to using full video datasets by training with just 15% the sample size. Meanwhile, we find that the proposed scheme can boost the performance of long video understanding without training with long video samples. We hope our study will spark more thinking about using MLLMs for video understanding and curation of high-quality data. The code is released at https://github.com/xjtupanda/T2Vid. 
653 |a Translating 
653 |a Data augmentation 
653 |a Video data 
653 |a Images 
653 |a Large language models 
653 |a Training 
653 |a Inference 
700 1 |a Fu, Chaoyou 
700 1 |a Zhao, Sirui 
700 1 |a Shen, Yunhang 
700 1 |a Ge, Chunjiang 
700 1 |a Yang, Yan 
700 1 |a Long, Zuwei 
700 1 |a Dai, Yuhan 
700 1 |a Xu, Tong 
700 1 |a Sun, Xing 
700 1 |a He, Ran 
700 1 |a Caifeng Shan 
700 1 |a Chen, Enhong 
773 0 |t arXiv.org  |g (Dec 2, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3138994644/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2411.19951