AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving

Abstract

Vision-Language Models (VLMs) show promise for autonomous driving, yet theirstruggle with hallucinations, inefficient reasoning, and limited real-worldvalidation hinders accurate perception and robust step-by-step reasoning. Toovercome this, we introduce \textbf{AgentThink}, a pioneering unified frameworkthat, for the first time, integrates Chain-of-Thought (CoT) reasoning withdynamic, agent-style tool invocation for autonomous driving tasks. AgentThink'score innovations include: \textbf{(i) Structured Data Generation}, byestablishing an autonomous driving tool library to automatically constructstructured, self-verified reasoning data explicitly incorporating tool usagefor diverse driving scenarios; \textbf{(ii) A Two-stage Training Pipeline},employing Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization(GRPO) to equip VLMs with the capability for autonomous tool invocation; and\textbf{(iii) Agent-style Tool-Usage Evaluation}, introducing a novelmulti-tool assessment protocol to rigorously evaluate the model's toolinvocation and utilization. Experiments on the DriveLMM-o1 benchmarkdemonstrate AgentThink significantly boosts overall reasoning scores by\textbf{53.91\%} and enhances answer accuracy by \textbf{33.54\%}, whilemarkedly improving reasoning quality and consistency. Furthermore, ablationstudies and robust zero-shot/few-shot generalization experiments across variousbenchmarks underscore its powerful capabilities. These findings highlight apromising trajectory for developing trustworthy and tool-aware autonomousdriving models.

Quick Read (beta)

loading the full paper ...