CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving

Guardado en:
Detalles Bibliográficos
Publicado en:arXiv.org (Dec 2, 2024), p. n/a
Autor principal: Arai, Hidehisa
Otros Autores: Miwa, Keita, Sasaki, Kento, Yamaguchi, Yu, Watanabe, Kohei, Aoki, Shunsuke, Yamamoto, Issei
Publicado:
Cornell University Library, arXiv.org
Materias:
Acceso en línea:Citation/Abstract
Full text outside of ProQuest
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!

MARC

LEADER 00000nab a2200000uu 4500
001 3095284977
003 UK-CbPIL
022 |a 2331-8422 
035 |a 3095284977 
045 0 |b d20241202 
100 1 |a Arai, Hidehisa 
245 1 |a CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving 
260 |b Cornell University Library, arXiv.org  |c Dec 2, 2024 
513 |a Working Paper 
520 3 |a Autonomous driving, particularly navigating complex and unanticipated scenarios, demands sophisticated reasoning and planning capabilities. While Multi-modal Large Language Models (MLLMs) offer a promising avenue for this, their use has been largely confined to understanding complex environmental contexts or generating high-level driving commands, with few studies extending their application to end-to-end path planning. A major research bottleneck is the lack of large-scale annotated datasets encompassing vision, language, and action. To address this issue, we propose CoVLA (Comprehensive Vision-Language-Action) Dataset, an extensive dataset comprising real-world driving videos spanning more than 80 hours. This dataset leverages a novel, scalable approach based on automated data processing and a caption generation pipeline to generate accurate driving trajectories paired with detailed natural language descriptions of driving environments and maneuvers. This approach utilizes raw in-vehicle sensor data, allowing it to surpass existing datasets in scale and annotation richness. Using CoVLA, we investigate the driving capabilities of MLLMs that can handle vision, language, and action in a variety of driving scenarios. Our results illustrate the strong proficiency of our model in generating coherent language and action outputs, emphasizing the potential of Vision-Language-Action (VLA) models in the field of autonomous driving. This dataset establishes a framework for robust, interpretable, and data-driven autonomous driving systems by providing a comprehensive platform for training and evaluating VLA models, contributing to safer and more reliable self-driving vehicles. The dataset is released for academic purpose. 
653 |a Language 
653 |a Autonomous cars 
653 |a Datasets 
653 |a Data processing 
653 |a Autonomous navigation 
653 |a Annotations 
653 |a Large language models 
653 |a Natural language processing 
653 |a Speech recognition 
700 1 |a Miwa, Keita 
700 1 |a Sasaki, Kento 
700 1 |a Yamaguchi, Yu 
700 1 |a Watanabe, Kohei 
700 1 |a Aoki, Shunsuke 
700 1 |a Yamamoto, Issei 
773 0 |t arXiv.org  |g (Dec 2, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3095284977/abstract/embedded/6A8EOT78XXH2IG52?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2408.10845