Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be ahighly effective strategy for endowing Large Language Models (LLMs) with robustmulti-step reasoning abilities. However, its design and optimizations remaintailored to purely textual domains, resulting in suboptimal performance whenapplied to multimodal reasoning tasks. In particular, we observe that a majorsource of error in current multimodal reasoning lies in the perception ofvisual inputs. To address this bottleneck, we propose PAPO, a novel policygradient algorithm that encourages the model to learn to perceive whilelearning to reason. Specifically, we introduce the Implicit Perception Loss inthe form of a KL divergence term, which can be seamlessly plugged intomainstream RLVR algorithms such as GRPO and DAPO. Notably, PAPO does not relyon additional data curation, reward models, or stronger teacher models. Tofurther enhance the training stability of PAPO, we introduce the Double EntropyLoss, which effectively regularizes the new KL objective without compromisingperformance. Despite its simplicity, PAPO yields significant overallimprovements of 4.4%-17.5% on diverse multimodal benchmarks. The improvementsare more pronounced, approaching 8.0%-19.1%, on tasks with high visiondependency. We also observe a substantial reduction of 30.5% in perceptionerrors, indicating improved perceptual capabilities with PAPO. Overall, ourwork introduces a deeper integration of perception-aware supervision into corelearning objectives and lays the groundwork for a new RL framework thatencourages visually grounded reasoning. Code and data will be made publiclyavailable for research purposes. Project page:https://mikewangwzhl.github.io/PAPO.