AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Abstract

Recent advancements in Vision-Language-Action (VLA) models have shown promisefor end-to-end autonomous driving by leveraging world knowledge and reasoningcapabilities. However, current VLA models often struggle with physicallyinfeasible action outputs, complex model structures, or unnecessarily longreasoning. In this paper, we propose AutoVLA, a novel VLA model that unifiesreasoning and action generation within a single autoregressive generation modelfor end-to-end autonomous driving. AutoVLA performs semantic reasoning andtrajectory planning directly from raw visual inputs and language instructions.We tokenize continuous trajectories into discrete, feasible actions, enablingdirect integration into the language model. For training, we employ supervisedfine-tuning to equip the model with dual thinking modes: fast thinking(trajectory-only) and slow thinking (enhanced with chain-of-thought reasoning).To further enhance planning performance and efficiency, we introduce areinforcement fine-tuning method based on Group Relative Policy Optimization(GRPO), reducing unnecessary reasoning in straightforward scenarios. Extensiveexperiments across real-world and simulated datasets and benchmarks, includingnuPlan, nuScenes, Waymo, and CARLA, demonstrate the competitive performance ofAutoVLA in both open-loop and closed-loop settings. Qualitative resultsshowcase the adaptive reasoning and accurate planning capabilities of AutoVLAin diverse scenarios.

Quick Read (beta)

loading the full paper ...