Multimodal Learning from Videos: Self-Supervised Pre-Training, Post-Training Alignment, and Benchmarks
Tallennettuna:
| Julkaisussa: | ProQuest Dissertations and Theses (2025) |
|---|---|
| Päätekijä: | |
| Muut tekijät: | , , , , |
| Julkaistu: |
ProQuest Dissertations & Theses
|
| Aiheet: | |
| Linkit: | Citation/Abstract Full Text - PDF |
| Tagit: |
Ei tageja, Lisää ensimmäinen tagi!
|
MARC
| LEADER | 00000nab a2200000uu 4500 | ||
|---|---|---|---|
| 001 | 3283374102 | ||
| 003 | UK-CbPIL | ||
| 020 | |a 9798270208233 | ||
| 035 | |a 3283374102 | ||
| 045 | 2 | |b d20250101 |b d20251231 | |
| 084 | |a 66569 |2 nlm | ||
| 100 | 1 | |a Sarkar, Pritam | |
| 245 | 1 | |a Multimodal Learning from Videos: Self-Supervised Pre-Training, Post-Training Alignment, and Benchmarks | |
| 260 | |b ProQuest Dissertations & Theses |c 2025 | ||
| 513 | |a Dissertation/Thesis | ||
| 520 | 3 | |a The basis of human intelligence is inherently multimodal, relying primarily on vision, language, and audio for perception, learning, and interaction. Consequently, the emergence of human-level artificial intelligence (AI) hinges on its ability to perceive and learn from these fundamental modalities at scale. This thesis investigates multimodal learning from videos, an abundant source of visual, auditory, and linguistic signals that capture real-world dynamics, intent, and physical phenomena. This thesis spans from self-supervised pre-training for learning general and abstract representations, to post-training alignment with subtle human preferences. It further contributes to the design benchmarks to foster progress in multimodal video understanding and reasoning.The first part of this thesis focuses on self-supervised learning from videos. It presents the first systematic study of video self-supervised methods under real-world distribution shifts, revealing novel insights and failure modes that can inform future algorithm design. In addition, it introduces multimodal self-supervised learning methods that leverage cross-modal relationships between audio and visual streams to learn more effective and robust representations for diverse audio-visual tasks.The second part of this thesis focuses on post-training alignment of multimodal models to refine their capabilities, such as aligning with human preferences, improving reasoning, and reducing hallucinations. In particular, it investigates methods for achieving fine-grained alignment with human intent and expectations. Moreover, it places special emphasis on preserving the models’ broad general capabilities after alignment, addressing a common limitation of existing approaches.Finally, this thesis highlights the importance of developing multimodal video benchmarks to meet emerging real-world challenges. It presents the first multimodal video dataset that captures both cognitive load and emotional states, enabling the development of AI systems that better support human well-being. Furthermore, this thesis addresses the lack of benchmarks for assessing the multi-step causal reasoning abilities of multimodal models directly from visual observations, which is essential for their application in complex decision-making real-world scenarios. It introduces the first benchmark specifically designed for evaluating long-form causal reasoning from videos.Collectively, these contributions pave the way for human-centric multimodal AI systems with improved capabilities and safe, reliable use. | |
| 653 | |a Behavior | ||
| 653 | |a Datasets | ||
| 653 | |a Video recordings | ||
| 653 | |a Optimization | ||
| 653 | |a Work at home | ||
| 653 | |a Cognitive load | ||
| 653 | |a Cognition & reasoning | ||
| 653 | |a Robotics | ||
| 653 | |a COVID-19 | ||
| 653 | |a Experiments | ||
| 653 | |a Knowledge | ||
| 653 | |a Neural networks | ||
| 653 | |a Pandemics | ||
| 653 | |a Multiple choice | ||
| 653 | |a Benchmarks | ||
| 653 | |a Preferences | ||
| 653 | |a Supervision | ||
| 653 | |a Design | ||
| 653 | |a Methods | ||
| 653 | |a Linguistics | ||
| 653 | |a Mental health | ||
| 653 | |a Annotations | ||
| 653 | |a Large language models | ||
| 653 | |a Learning | ||
| 653 | |a Artificial intelligence | ||
| 653 | |a Cognitive psychology | ||
| 653 | |a Epidemiology | ||
| 653 | |a Film studies | ||
| 653 | |a Management | ||
| 700 | 1 | |a Posen, Aaron | |
| 700 | 1 | |a Beirami, Ahmad | |
| 700 | 1 | |a Ebrahimi, Sayna | |
| 700 | 1 | |a Arık, Sercan | |
| 700 | 1 | |a Pfister, Tomas | |
| 773 | 0 | |t ProQuest Dissertations and Theses |g (2025) | |
| 786 | 0 | |d ProQuest |t ProQuest Dissertations & Theses Global | |
| 856 | 4 | 1 | |3 Citation/Abstract |u https://www.proquest.com/docview/3283374102/abstract/embedded/H09TXR3UUZB2ISDL?source=fedsrch |
| 856 | 4 | 0 | |3 Full Text - PDF |u https://www.proquest.com/docview/3283374102/fulltextPDF/embedded/H09TXR3UUZB2ISDL?source=fedsrch |