Large Language Model-Brained GUI Agents: A Survey

Gardado en:
Detalles Bibliográficos
Publicado en:arXiv.org (Dec 23, 2024), p. n/a
Autor Principal: Zhang, Chaoyun
Outros autores: He, Shilin, Qian, Jiaxu, Bowen, Li, Li, Liqun, Qin, Si, Kang, Yu, Ma, Minghua, Liu, Guyue, Lin, Qingwei, Rajmohan, Saravan, Zhang, Dongmei, Zhang, Qi
Publicado:
Cornell University Library, arXiv.org
Materias:
Acceso en liña:Citation/Abstract
Full text outside of ProQuest
Etiquetas: Engadir etiqueta
Sen Etiquetas, Sexa o primeiro en etiquetar este rexistro!

MARC

LEADER 00000nab a2200000uu 4500
001 3133825897
003 UK-CbPIL
022 |a 2331-8422 
035 |a 3133825897 
045 0 |b d20241223 
100 1 |a Zhang, Chaoyun 
245 1 |a Large Language Model-Brained GUI Agents: A Survey 
260 |b Cornell University Library, arXiv.org  |c Dec 23, 2024 
513 |a Working Paper 
520 3 |a GUIs have long been central to human-computer interaction, providing an intuitive and visually-driven way to access and interact with digital systems. The advent of LLMs, particularly multimodal models, has ushered in a new era of GUI automation. They have demonstrated exceptional capabilities in natural language understanding, code generation, and visual processing. This has paved the way for a new generation of LLM-brained GUI agents capable of interpreting complex GUI elements and autonomously executing actions based on natural language instructions. These agents represent a paradigm shift, enabling users to perform intricate, multi-step tasks through simple conversational commands. Their applications span across web navigation, mobile app interactions, and desktop automation, offering a transformative user experience that revolutionizes how individuals interact with software. This emerging field is rapidly advancing, with significant progress in both research and industry. To provide a structured understanding of this trend, this paper presents a comprehensive survey of LLM-brained GUI agents, exploring their historical evolution, core components, and advanced techniques. We address research questions such as existing GUI agent frameworks, the collection and utilization of data for training specialized GUI agents, the development of large action models tailored for GUI tasks, and the evaluation metrics and benchmarks necessary to assess their effectiveness. Additionally, we examine emerging applications powered by these agents. Through a detailed analysis, this survey identifies key research gaps and outlines a roadmap for future advancements in the field. By consolidating foundational knowledge and state-of-the-art developments, this work aims to guide both researchers and practitioners in overcoming challenges and unlocking the full potential of LLM-brained GUI agents. 
653 |a Digital systems 
653 |a User experience 
653 |a Large language models 
653 |a Automation 
653 |a Digital computers 
653 |a Applications programs 
653 |a Natural language processing 
653 |a Task complexity 
653 |a Speech recognition 
653 |a Natural language 
653 |a Mobile computing 
700 1 |a He, Shilin 
700 1 |a Qian, Jiaxu 
700 1 |a Bowen, Li 
700 1 |a Li, Liqun 
700 1 |a Qin, Si 
700 1 |a Kang, Yu 
700 1 |a Ma, Minghua 
700 1 |a Liu, Guyue 
700 1 |a Lin, Qingwei 
700 1 |a Rajmohan, Saravan 
700 1 |a Zhang, Dongmei 
700 1 |a Zhang, Qi 
773 0 |t arXiv.org  |g (Dec 23, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3133825897/abstract/embedded/6A8EOT78XXH2IG52?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2411.18279