ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model

Abstract

The limited capacity for fine-grained visual perception presents a criticalbottleneck for Vision-Language Models (VLMs) in real-world applications.Addressing this is challenging due to the scarcity of high-quality data and thelimitations of existing methods: supervised fine-tuning (SFT) often compromisesgeneral capabilities, while reinforcement fine-tuning (RFT) prioritizes textualreasoning over visual perception. To bridge this gap, we propose a noveltwo-stage task that structures visual perception learning as a coarse-to-fineprogressive process. Based on this task formulation, we develop ViPER, aself-bootstrapping framework specifically designed to enable iterativeevolution through self-critiquing and self-prediction. By synergisticallyintegrating image-level and instance-level reconstruction with a two-stagereinforcement learning strategy, ViPER establishes a closed-loop trainingparadigm, where internally synthesized data directly fuel the enhancement ofperceptual ability. Applied to the Qwen2.5-VL family, ViPER produces theQwen-Viper series. With an average gain of 1.7% on seven comprehensivebenchmarks spanning various tasks and up to 6.0% on fine-grained perception,Qwen-Viper consistently demonstrates superior performance across differentvision-language scenarios while maintaining generalizability. Beyond enablingself-improvement in perceptual capabilities, ViPER provides concrete evidencefor the reciprocal relationship between generation and understanding, abreakthrough to developing more autonomous and capable VLMs.

Quick Read (beta)

loading the full paper ...