DeepEyesV2: Toward Agentic Multimodal Model

Abstract

Agentic multimodal models should not only comprehend text and images, butalso actively invoke external tools, such as code execution environments andweb search, and integrate these operations into reasoning. In this work, weintroduce DeepEyesV2 and explore how to build an agentic multimodal model fromthe perspectives of data construction, training methods, and model evaluation.We observe that direct reinforcement learning alone fails to induce robusttool-use behavior. This phenomenon motivates a two-stage training pipeline: acold-start stage to establish tool-use patterns, and reinforcement learningstage to further refine tool invocation. We curate a diverse, moderatelychallenging training dataset, specifically including examples where tool use isbeneficial. We further introduce RealX-Bench, a comprehensive benchmarkdesigned to evaluate real-world multimodal reasoning, which inherently requiresthe integration of multiple capabilities, including perception, search, andreasoning. We evaluate DeepEyesV2 on RealX-Bench and other representativebenchmarks, demonstrating its effectiveness across real-world understanding,mathematical reasoning, and search-intensive tasks. Moreover, DeepEyesV2exhibits task-adaptive tool invocation, tending to use image operations forperception tasks and numerical computations for reasoning tasks. Reinforcementlearning further enables complex tool combinations and allows model toselectively invoke tools based on context. We hope our study can provideguidance for community in developing agentic multimodal models.

Quick Read (beta)

loading the full paper ...