Abstract
The limited capacity for fine-grained visual perception presents a criticalbottleneck for Vision-Language Models (VLMs) in real-world applications.Addressing this is challenging due to the scarcity of high-quality data and thelimitations of existing methods: supervised fine-tuning (SFT) often compromisesgeneral capabilities, while reinforcement fine-tuning (RFT) prioritizes textualreasoning over visual perception. To bridge this gap, we propose a noveltwo-stage task that structures visual perception learning as a coarse-to-fineprogressive process. Based on this task formulation, we develop ViPER, aself-bootstrapping framework specifically designed to enable iterativeevolution through self-critiquing and self-prediction. By synergisticallyintegrating image-level and instance-level reconstruction with a two-stagereinforcement learning strategy, ViPER establishes a closed-loop trainingparadigm, where internally synthesized data directly fuel the enhancement ofperceptual ability. Applied to the Qwen2.5-VL family, ViPER produces theQwen-Viper series. With an average gain of 1.7% on seven comprehensivebenchmarks spanning various tasks and up to 6.0% on fine-grained perception,Qwen-Viper consistently demonstrates superior performance across differentvision-language scenarios while maintaining generalizability. Beyond enablingself-improvement in perceptual capabilities, ViPER provides concrete evidencefor the reciprocal relationship between generation and understanding, abreakthrough to developing more autonomous and capable VLMs.