Multimodal Learning from Videos: Self-Supervised Pre-Training, Post-Training Alignment, and Benchmarks

Tallennettuna:
Bibliografiset tiedot
Julkaisussa:ProQuest Dissertations and Theses (2025)
Päätekijä: Sarkar, Pritam
Muut tekijät: Posen, Aaron, Beirami, Ahmad, Ebrahimi, Sayna, Arık, Sercan, Pfister, Tomas
Julkaistu:
ProQuest Dissertations & Theses
Aiheet:
Linkit:Citation/Abstract
Full Text - PDF
Tagit: Lisää tagi
Ei tageja, Lisää ensimmäinen tagi!

MARC

LEADER 00000nab a2200000uu 4500
001 3283374102
003 UK-CbPIL
020 |a 9798270208233 
035 |a 3283374102 
045 2 |b d20250101  |b d20251231 
084 |a 66569  |2 nlm 
100 1 |a Sarkar, Pritam 
245 1 |a Multimodal Learning from Videos: Self-Supervised Pre-Training, Post-Training Alignment, and Benchmarks 
260 |b ProQuest Dissertations & Theses  |c 2025 
513 |a Dissertation/Thesis 
520 3 |a The basis of human intelligence is inherently multimodal, relying primarily on vision, language, and audio for perception, learning, and interaction. Consequently, the emergence of human-level artificial intelligence (AI) hinges on its ability to perceive and learn from these fundamental modalities at scale. This thesis investigates multimodal learning from videos, an abundant source of visual, auditory, and linguistic signals that capture real-world dynamics, intent, and physical phenomena. This thesis spans from self-supervised pre-training for learning general and abstract representations, to post-training alignment with subtle human preferences. It further contributes to the design benchmarks to foster progress in multimodal video understanding and reasoning.The first part of this thesis focuses on self-supervised learning from videos. It presents the first systematic study of video self-supervised methods under real-world distribution shifts, revealing novel insights and failure modes that can inform future algorithm design. In addition, it introduces multimodal self-supervised learning methods that leverage cross-modal relationships between audio and visual streams to learn more effective and robust representations for diverse audio-visual tasks.The second part of this thesis focuses on post-training alignment of multimodal models to refine their capabilities, such as aligning with human preferences, improving reasoning, and reducing hallucinations. In particular, it investigates methods for achieving fine-grained alignment with human intent and expectations. Moreover, it places special emphasis on preserving the models’ broad general capabilities after alignment, addressing a common limitation of existing approaches.Finally, this thesis highlights the importance of developing multimodal video benchmarks to meet emerging real-world challenges. It presents the first multimodal video dataset that captures both cognitive load and emotional states, enabling the development of AI systems that better support human well-being. Furthermore, this thesis addresses the lack of benchmarks for assessing the multi-step causal reasoning abilities of multimodal models directly from visual observations, which is essential for their application in complex decision-making real-world scenarios. It introduces the first benchmark specifically designed for evaluating long-form causal reasoning from videos.Collectively, these contributions pave the way for human-centric multimodal AI systems with improved capabilities and safe, reliable use. 
653 |a Behavior 
653 |a Datasets 
653 |a Video recordings 
653 |a Optimization 
653 |a Work at home 
653 |a Cognitive load 
653 |a Cognition & reasoning 
653 |a Robotics 
653 |a COVID-19 
653 |a Experiments 
653 |a Knowledge 
653 |a Neural networks 
653 |a Pandemics 
653 |a Multiple choice 
653 |a Benchmarks 
653 |a Preferences 
653 |a Supervision 
653 |a Design 
653 |a Methods 
653 |a Linguistics 
653 |a Mental health 
653 |a Annotations 
653 |a Large language models 
653 |a Learning 
653 |a Artificial intelligence 
653 |a Cognitive psychology 
653 |a Epidemiology 
653 |a Film studies 
653 |a Management 
700 1 |a Posen, Aaron 
700 1 |a Beirami, Ahmad 
700 1 |a Ebrahimi, Sayna 
700 1 |a Arık, Sercan 
700 1 |a Pfister, Tomas 
773 0 |t ProQuest Dissertations and Theses  |g (2025) 
786 0 |d ProQuest  |t ProQuest Dissertations & Theses Global 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3283374102/abstract/embedded/H09TXR3UUZB2ISDL?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3283374102/fulltextPDF/embedded/H09TXR3UUZB2ISDL?source=fedsrch