Multimodal Learning from Videos: Self-Supervised Pre-Training, Post-Training Alignment, and Benchmarks

Spremljeno u:

Bibliografski detalji
Izdano u:	ProQuest Dissertations and Theses (2025)
Glavni autor:	Sarkar, Pritam
Daljnji autori:	Posen, Aaron, Beirami, Ahmad, Ebrahimi, Sayna, Arık, Sercan, Pfister, Tomas
Izdano:	ProQuest Dissertations & Theses
Teme:	Behavior Datasets Video recordings Optimization Work at home Cognitive load Cognition & reasoning Robotics COVID-19 Experiments Knowledge Neural networks Pandemics Multiple choice Benchmarks Preferences Supervision Design Methods Linguistics Mental health Annotations Large language models Learning Artificial intelligence Cognitive psychology Epidemiology Film studies Management
Online pristup:	Citation/Abstract Full Text - PDF
Oznake:	Dodaj oznaku Bez oznaka, Budi prvi tko označuje ovaj zapis!

Opis
Sažetak:	The basis of human intelligence is inherently multimodal, relying primarily on vision, language, and audio for perception, learning, and interaction. Consequently, the emergence of human-level artificial intelligence (AI) hinges on its ability to perceive and learn from these fundamental modalities at scale. This thesis investigates multimodal learning from videos, an abundant source of visual, auditory, and linguistic signals that capture real-world dynamics, intent, and physical phenomena. This thesis spans from self-supervised pre-training for learning general and abstract representations, to post-training alignment with subtle human preferences. It further contributes to the design benchmarks to foster progress in multimodal video understanding and reasoning.The first part of this thesis focuses on self-supervised learning from videos. It presents the first systematic study of video self-supervised methods under real-world distribution shifts, revealing novel insights and failure modes that can inform future algorithm design. In addition, it introduces multimodal self-supervised learning methods that leverage cross-modal relationships between audio and visual streams to learn more effective and robust representations for diverse audio-visual tasks.The second part of this thesis focuses on post-training alignment of multimodal models to refine their capabilities, such as aligning with human preferences, improving reasoning, and reducing hallucinations. In particular, it investigates methods for achieving fine-grained alignment with human intent and expectations. Moreover, it places special emphasis on preserving the models’ broad general capabilities after alignment, addressing a common limitation of existing approaches.Finally, this thesis highlights the importance of developing multimodal video benchmarks to meet emerging real-world challenges. It presents the first multimodal video dataset that captures both cognitive load and emotional states, enabling the development of AI systems that better support human well-being. Furthermore, this thesis addresses the lack of benchmarks for assessing the multi-step causal reasoning abilities of multimodal models directly from visual observations, which is essential for their application in complex decision-making real-world scenarios. It introduces the first benchmark specifically designed for evaluating long-form causal reasoning from videos.Collectively, these contributions pave the way for human-centric multimodal AI systems with improved capabilities and safe, reliable use.
ISBN:	9798270208233
Izvor:	ProQuest Dissertations & Theses Global