R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning

Abstract

Multimodal Reward Models (MRMs) play a crucial role in enhancing theperformance of Multimodal Large Language Models (MLLMs). While recentadvancements have primarily focused on improving the model structure andtraining data of MRMs, there has been limited exploration into theeffectiveness of long-term reasoning capabilities for reward modeling and howto activate these capabilities in MRMs. In this paper, we explore howReinforcement Learning (RL) can be used to improve reward modeling.Specifically, we reformulate the reward modeling problem as a rule-based RLtask. However, we observe that directly applying existing RL algorithms, suchas Reinforce++, to reward modeling often leads to training instability or evencollapse due to the inherent limitations of these algorithms. To address thisissue, we propose the StableReinforce algorithm, which refines the trainingloss, advantage estimation strategy, and reward design of existing RL methods.These refinements result in more stable training dynamics and superiorperformance. To facilitate MRM training, we collect 200K preference data fromdiverse datasets. Our reward model, R1-Reward, trained using theStableReinforce algorithm on this dataset, significantly improves performanceon multimodal reward modeling benchmarks. Compared to previous SOTA models,R1-Reward achieves a $8.4\%$ improvement on the VL Reward-Bench and a $14.3\%$improvement on the Multimodal Reward Bench. Moreover, with more inferencecompute, R1-Reward's performance is further enhanced, highlighting thepotential of RL algorithms in optimizing MRMs.

Quick Read (beta)

loading the full paper ...