Towards Multi-Modal Interactive Systems That Connect Audio, Vision and Beyond

Guardado en:
Detalles Bibliográficos
Publicado en:ProQuest Dissertations and Theses (2025)
Autor principal: Liu, Xiulong
Publicado:
ProQuest Dissertations & Theses
Materias:
Acceso en línea:Citation/Abstract
Full Text - PDF
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
Descripción
Resumen:Human perception relies on the integration of multi-sensory signals such as vision, audio and language, enabling complex tasks such as interpreting environments and making decisions. In Artificial Intelligence (AI), multi-modal learning seeks to enable such abilities. Significant progress has been made in text-centered multi-modal learning, e.g., vision-language learning, while understanding the connections between vision and audio is still less characterized. Unlocking these relationships can enable AI systems to generate realistic audio, enhance contextual understanding for execution of complex tasks, and create interactive interfaces. These can support applications in augmented reality, robotics, and virtual experiences.Audio-visual learning is challenging due to the high-dimensional and redundant nature of both vision and audio signals. The signals are continuous and contain various information that are not necessarily organized. Existing text-centered multi-modal learning methods do not generalize to audio-visual scenarios since they rely on strong semantic properties of text modality. In particular, when these methods are directly adapted to audio-visual tasks, they often fail to capture meaningful cross-modal interactions that are important for downstream tasks. It is therefore important to develop specific multi-modal systems for audio-visual scenarios to effectively handle redundancy. These systems should be capable of solving a range of tasks, from audio generation to complex multi-modal understanding problems such as question answering and navigation.To address these challenges, in my PhD research I developed multi-modal interactive systems. These systems are of two types: Multi-modal audio generation systems and Multi-modal task solving systems. Audio generation systems aim to create as close as possible realistic audio from visual and other modalities inputs by modeling temporal and semantic correspondences. Task solving systems process audio-visual and other multi-modal inputs to support tasks such as retrieval, question answering, and navigation. They do so through multi-modal interactions to extract task-relevant cues.
ISBN:9798288835711
Fuente:ProQuest Dissertations & Theses Global