Towards Multi-Modal Interactive Systems That Connect Audio, Vision and Beyond

Guardat en:
Dades bibliogràfiques
Publicat a:ProQuest Dissertations and Theses (2025)
Autor principal: Liu, Xiulong
Publicat:
ProQuest Dissertations & Theses
Matèries:
Accés en línia:Citation/Abstract
Full Text - PDF
Etiquetes: Afegir etiqueta
Sense etiquetes, Sigues el primer a etiquetar aquest registre!

MARC

LEADER 00000nab a2200000uu 4500
001 3230296010
003 UK-CbPIL
020 |a 9798288835711 
035 |a 3230296010 
045 2 |b d20250101  |b d20251231 
084 |a 66569  |2 nlm 
100 1 |a Liu, Xiulong 
245 1 |a Towards Multi-Modal Interactive Systems That Connect Audio, Vision and Beyond 
260 |b ProQuest Dissertations & Theses  |c 2025 
513 |a Dissertation/Thesis 
520 3 |a Human perception relies on the integration of multi-sensory signals such as vision, audio and language, enabling complex tasks such as interpreting environments and making decisions. In Artificial Intelligence (AI), multi-modal learning seeks to enable such abilities. Significant progress has been made in text-centered multi-modal learning, e.g., vision-language learning, while understanding the connections between vision and audio is still less characterized. Unlocking these relationships can enable AI systems to generate realistic audio, enhance contextual understanding for execution of complex tasks, and create interactive interfaces. These can support applications in augmented reality, robotics, and virtual experiences.Audio-visual learning is challenging due to the high-dimensional and redundant nature of both vision and audio signals. The signals are continuous and contain various information that are not necessarily organized. Existing text-centered multi-modal learning methods do not generalize to audio-visual scenarios since they rely on strong semantic properties of text modality. In particular, when these methods are directly adapted to audio-visual tasks, they often fail to capture meaningful cross-modal interactions that are important for downstream tasks. It is therefore important to develop specific multi-modal systems for audio-visual scenarios to effectively handle redundancy. These systems should be capable of solving a range of tasks, from audio generation to complex multi-modal understanding problems such as question answering and navigation.To address these challenges, in my PhD research I developed multi-modal interactive systems. These systems are of two types: Multi-modal audio generation systems and Multi-modal task solving systems. Audio generation systems aim to create as close as possible realistic audio from visual and other modalities inputs by modeling temporal and semantic correspondences. Task solving systems process audio-visual and other multi-modal inputs to support tasks such as retrieval, question answering, and navigation. They do so through multi-modal interactions to extract task-relevant cues. 
653 |a Computer science 
653 |a Computer engineering 
653 |a Artificial intelligence 
653 |a Electrical engineering 
773 0 |t ProQuest Dissertations and Theses  |g (2025) 
786 0 |d ProQuest  |t ProQuest Dissertations & Theses Global 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3230296010/abstract/embedded/L8HZQI7Z43R0LA5T?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3230296010/fulltextPDF/embedded/L8HZQI7Z43R0LA5T?source=fedsrch