InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

Guardado en:
Bibliografiske detaljer
Udgivet i:arXiv.org (Dec 12, 2024), p. n/a
Hovedforfatter: Zhang, Pan
Andre forfattere: Dong, Xiaoyi, Cao, Yuhang, Zang, Yuhang, Qian, Rui, Wei, Xilin, Chen, Lin, Li, Yifei, Niu, Junbo, Ding, Shuangrui, Guo, Qipeng, Duan, Haodong, Chen, Xin, Han, Lv, Nie, Zheng, Zhang, Min, Wang, Bin, Zhang, Wenwei, Zhang, Xinyue, Ge, Jiaye, Li, Wei, Li, Jingwen, Tu, Zhongying, He, Conghui, Zhang, Xingcheng, Chen, Kai, Yu, Qiao, Lin, Dahua, Wang, Jiaqi
Udgivet:
Cornell University Library, arXiv.org
Fag:
Online adgang:Citation/Abstract
Full text outside of ProQuest
Tags: Tilføj Tag
Ingen Tags, Vær først til at tagge denne postø!

MARC

LEADER 00000nab a2200000uu 4500
001 3144196465
003 UK-CbPIL
022 |a 2331-8422 
035 |a 3144196465 
045 0 |b d20241212 
100 1 |a Zhang, Pan 
245 1 |a InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions 
260 |b Cornell University Library, arXiv.org  |c Dec 12, 2024 
513 |a Working Paper 
520 3 |a Creating AI systems that can interact with environments over long periods, similar to human cognition, has been a longstanding research goal. Recent advancements in multimodal large language models (MLLMs) have made significant strides in open-world understanding. However, the challenge of continuous and simultaneous streaming perception, memory, and reasoning remains largely unexplored. Current MLLMs are constrained by their sequence-to-sequence architecture, which limits their ability to process inputs and generate responses simultaneously, akin to being unable to think while perceiving. Furthermore, relying on long contexts to store historical data is impractical for long-term interactions, as retaining all information becomes costly and inefficient. Therefore, rather than relying on a single foundation model to perform all functions, this project draws inspiration from the concept of the Specialized Generalist AI and introduces disentangled streaming perception, reasoning, and memory mechanisms, enabling real-time interaction with streaming video and audio input. The proposed framework InternLM-XComposer2.5-OmniLive (IXC2.5-OL) consists of three key modules: (1) Streaming Perception Module: Processes multimodal information in real-time, storing key details in memory and triggering reasoning in response to user queries. (2) Multi-modal Long Memory Module: Integrates short-term and long-term memory, compressing short-term memories into long-term ones for efficient retrieval and improved accuracy. (3) Reasoning Module: Responds to queries and executes reasoning tasks, coordinating with the perception and memory modules. This project simulates human-like cognition, enabling multimodal large language models to provide continuous and adaptive service over time. 
653 |a Perception 
653 |a Memory tasks 
653 |a Modules 
653 |a Audio data 
653 |a Large language models 
653 |a Real time 
653 |a Cognitive tasks 
653 |a Cognition & reasoning 
653 |a Reasoning 
653 |a Cognition 
700 1 |a Dong, Xiaoyi 
700 1 |a Cao, Yuhang 
700 1 |a Zang, Yuhang 
700 1 |a Qian, Rui 
700 1 |a Wei, Xilin 
700 1 |a Chen, Lin 
700 1 |a Li, Yifei 
700 1 |a Niu, Junbo 
700 1 |a Ding, Shuangrui 
700 1 |a Guo, Qipeng 
700 1 |a Duan, Haodong 
700 1 |a Chen, Xin 
700 1 |a Han, Lv 
700 1 |a Nie, Zheng 
700 1 |a Zhang, Min 
700 1 |a Wang, Bin 
700 1 |a Zhang, Wenwei 
700 1 |a Zhang, Xinyue 
700 1 |a Ge, Jiaye 
700 1 |a Li, Wei 
700 1 |a Li, Jingwen 
700 1 |a Tu, Zhongying 
700 1 |a He, Conghui 
700 1 |a Zhang, Xingcheng 
700 1 |a Chen, Kai 
700 1 |a Yu, Qiao 
700 1 |a Lin, Dahua 
700 1 |a Wang, Jiaqi 
773 0 |t arXiv.org  |g (Dec 12, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3144196465/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2412.09596