SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward

Abstract

Recent advances have shown success in eliciting strong reasoning abilities inmultimodal large language models (MLLMs) through rule-based reinforcementlearning (RL) with outcome rewards. However, this paradigm typically lackssupervision over the thinking process leading to the final outcome.As a result,the model may learn sub-optimal reasoning strategies, which can hinder itsgeneralization ability. In light of this, we propose SophiaVL-R1, as an attemptto add reward signals for the thinking process in this paradigm. To achievethis, we first train a thinking reward model that evaluates the quality of theentire thinking process. Given that the thinking reward may be unreliable forcertain samples due to reward hacking, we propose the Trust-GRPO method, whichassigns a trustworthiness weight to the thinking reward during training. Thisweight is computed based on the thinking reward comparison of responses leadingto correct answers versus incorrect answers, helping to mitigate the impact ofpotentially unreliable thinking rewards. Moreover, we design an annealingtraining strategy that gradually reduces the thinking reward over time,allowing the model to rely more on the accurate rule-based outcome reward inlater training stages. Experiments show that our SophiaVL-R1 surpasses a seriesof reasoning MLLMs on various benchmarks (e.g., MathVisita, MMMU),demonstrating strong reasoning and generalization capabilities. Notably, ourSophiaVL-R1-7B even outperforms LLaVA-OneVision-72B on most benchmarks, despitethe latter having 10 times more parameters. All code, models, and datasets aremade publicly available at https://github.com/kxfan2002/SophiaVL-R1.

Quick Read (beta)

loading the full paper ...