Towards Multi-Modal Interactive Systems That Connect Audio, Vision and Beyond

Salvato in:
Dettagli Bibliografici
Pubblicato in:ProQuest Dissertations and Theses (2025)
Autore principale: Liu, Xiulong
Pubblicazione:
ProQuest Dissertations & Theses
Soggetti:
Accesso online:Citation/Abstract
Full Text - PDF
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
Descrizione
Abstract:Human perception relies on the integration of multi-sensory signals such as vision, audio and language, enabling complex tasks such as interpreting environments and making decisions. In Artificial Intelligence (AI), multi-modal learning seeks to enable such abilities. Significant progress has been made in text-centered multi-modal learning, e.g., vision-language learning, while understanding the connections between vision and audio is still less characterized. Unlocking these relationships can enable AI systems to generate realistic audio, enhance contextual understanding for execution of complex tasks, and create interactive interfaces. These can support applications in augmented reality, robotics, and virtual experiences.Audio-visual learning is challenging due to the high-dimensional and redundant nature of both vision and audio signals. The signals are continuous and contain various information that are not necessarily organized. Existing text-centered multi-modal learning methods do not generalize to audio-visual scenarios since they rely on strong semantic properties of text modality. In particular, when these methods are directly adapted to audio-visual tasks, they often fail to capture meaningful cross-modal interactions that are important for downstream tasks. It is therefore important to develop specific multi-modal systems for audio-visual scenarios to effectively handle redundancy. These systems should be capable of solving a range of tasks, from audio generation to complex multi-modal understanding problems such as question answering and navigation.To address these challenges, in my PhD research I developed multi-modal interactive systems. These systems are of two types: Multi-modal audio generation systems and Multi-modal task solving systems. Audio generation systems aim to create as close as possible realistic audio from visual and other modalities inputs by modeling temporal and semantic correspondences. Task solving systems process audio-visual and other multi-modal inputs to support tasks such as retrieval, question answering, and navigation. They do so through multi-modal interactions to extract task-relevant cues.
ISBN:9798288835711
Fonte:ProQuest Dissertations & Theses Global