VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning

Abstract

Large vision-language models exhibit inherent capabilities to handle diversevisual perception tasks. In this paper, we introduce VisionReasoner, a unifiedframework capable of reasoning and solving multiple visual perception taskswithin a shared model. Specifically, by designing novel multi-object cognitivelearning strategies and systematic task reformulation, VisionReasoner enhancesits reasoning capabilities to analyze visual inputs, and addresses diverseperception tasks in a unified framework. The model generates a structuredreasoning process before delivering the desired outputs responding to userqueries. To rigorously assess unified visual perception capabilities, weevaluate VisionReasoner on ten diverse tasks spanning three critical domains:detection, segmentation, and counting. Experimental results show thatVisionReasoner achieves superior performance as a unified model, outperformingQwen2.5VL by relative margins of 29.1% on COCO (detection), 22.1% on ReasonSeg(segmentation), and 15.3% on CountBench (counting).

Quick Read (beta)

loading the full paper ...