VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT

محفوظ في:

التفاصيل البيبلوغرافية
الحاوية / القاعدة:	Applied Sciences vol. 14, no. 5 (2024), p. 1894
المؤلف الرئيسي:	Xu, Yifang
مؤلفون آخرون:	Sun, Yunzhuo, Xie, Zien, Zhai, Benxiang, Du, Sidan
منشور في:	MDPI AG
الموضوعات:	Language Design Methods Linguistics Annotations Queries Proposals Natural language Bias Prejudice
الوصول للمادة أونلاين:	Citation/Abstract Full Text + Graphics Full Text - PDF
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!

الوصف
مستخلص:	Video temporal grounding (VTG) aims to locate specific temporal segments from an untrimmed video based on a linguistic query. Most existing VTG models are trained on extensive annotated video-text pairs, a process that not only introduces human biases from the queries but also incurs significant computational costs. To tackle these challenges, we propose VTG-GPT, a GPT-based method for zero-shot VTG without training or fine-tuning. To reduce prejudice in the original query, we employ Baichuan2 to generate debiased queries. To lessen redundant information in videos, we apply MiniGPT-v2 to transform visual content into more precise captions. Finally, we devise the proposal generator and post-processing to produce accurate segments from debiased queries and image captions. Extensive experiments demonstrate that VTG-GPT significantly outperforms SOTA methods in zero-shot settings and surpasses unsupervised approaches. More notably, it achieves competitive performance comparable to supervised methods. The code is available on GitHub.
تدمد:	2076-3417
DOI:	10.3390/app14051894
المصدر:	Publicly Available Content Database