DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models

Abstract

The recent explosive interest in the reasoning capabilities of large languagemodels, such as DeepSeek-R1, has demonstrated remarkable success throughreinforcement learning-based fine-tuning frameworks, exemplified by methodslike Group Relative Policy Optimization (GRPO). However, such reasoningabilities remain underexplored and notably absent in vision foundation models,including representation models like the DINO series. In this work, we propose\textbf{DINO-R1}, the first such attempt to incentivize visual in-contextreasoning capabilities of vision foundation models using reinforcementlearning. Specifically, DINO-R1 introduces \textbf{Group Relative QueryOptimization (GRQO)}, a novel reinforcement-style training strategy explicitlydesigned for query-based representation models, which computes query-levelrewards based on group-normalized alignment quality. We also applyKL-regularization to stabilize the objectness distribution to reduce thetraining instability. This joint optimization enables dense and expressivesupervision across queries while mitigating overfitting and distributionaldrift. Building upon Grounding-DINO, we train a series of DINO-R1 family modelsthat integrate a visual prompt encoder and a visual-guided query selectionmechanism. Extensive experiments on COCO, LVIS, and ODinW demonstrate thatDINO-R1 significantly outperforms supervised fine-tuning baselines, achievingstrong generalization in both open-vocabulary and closed-set visual promptingscenarios.

Quick Read (beta)

loading the full paper ...