Perceiving and Simulating Human-World Interactions for Egocentric Agents

Guardado en:
Detalles Bibliográficos
Publicado en:ProQuest Dissertations and Theses (2025)
Autor principal: Pan, Boxiao
Publicado:
ProQuest Dissertations & Theses
Materias:
Acceso en línea:Citation/Abstract
Full Text - PDF
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
Descripción
Resumen:The research in this dissertation is motivated by the challenge of building computational systems that can perceive, understand, and interact with the world from a first-person, or egocentric, perspective. The central premise of this work is that the egocentric viewpoint is fundamental for creating technologies enabling entities that can effectively and safely collaborate with people in their daily environments, from augmented reality assistants to mobile robots.A principal impediment to progress in this domain is the scarcity of large-scale, diverse datasets that provide the necessary supervision for training robust models. Specifically, there is a need for data that concurrently captures rich, first-person sensory inputs with their corresponding three-dimensional world states and actions. The difficulty in acquiring and annotating such data at scale motivated the primary technical challenges that this thesis aims to address.To overcome this data scarcity problem, the research presented here is structured around the paradigm of a "perception-simulation loop". This framework treats perception and simulation as symbiotic and complementary processes, where each can be used to improve the other. The contributions of this dissertation are therefore organized into two main parts, each investigating a different arc of this loop.The first part of the thesis focuses on perception, investigating methods that learn directly from egocentric visual data. The work on COPILOT explores the use of large-scale synthetic data for near-term collision prediction, while the work on LookOut leverages targeted real-world data collection to model longer-term navigational intent in dynamic environments. The second part of the thesis shifts to simulation and modeling, exploring the use of strong priors to generate plausible human-world interactions. Here, MultiPhys demonstrates a physics-based approach to refining multi-person motion estimates, while ActAnywhere introduces a data-driven approach, using a generative model trained on large-scale video to synthesize semantically coherent scenes.Collectively, these projects demonstrate a multi-faceted strategy for mitigating the data scarcity problem in egocentric perception and simulation. The thesis concludes with a summary of these contributions and a discussion of promising future research directions. I hope that the methods and insights presented here will contribute to the development of more powerful, robust, safe, and intuitive interactive systems.
ISBN:9798265427540
Fuente:ProQuest Dissertations & Theses Global