Observe-R1: Unlocking Reasoning Abilities of MLLMs with Dynamic Progressive Reinforcement Learning

Abstract

Reinforcement Learning (RL) has shown promise in improving the reasoningabilities of Large Language Models (LLMs). However, the specific challenges ofadapting RL to multimodal data and formats remain relatively unexplored. Inthis work, we present Observe-R1, a novel framework aimed at enhancing thereasoning capabilities of multimodal large language models (MLLMs). We drawinspirations from human learning progression--from simple to complex and easyto difficult, and propose a gradual learning paradigm for MLLMs. To this end,we construct the NeuraLadder dataset, which is organized and sampled accordingto the difficulty and complexity of data samples for RL training. To tacklemultimodal tasks, we introduce a multimodal format constraint that encouragescareful observation of images, resulting in enhanced visual abilities andclearer and more structured responses. Additionally, we implement a bonusreward system that favors concise, correct answers within a length constraint,alongside a dynamic weighting mechanism that prioritizes uncertain andmedium-difficulty problems, ensuring that more informative samples have agreater impact on training. Our experiments with the Qwen2.5-VL-3B andQwen2.5-VL-7B models on 20k samples from the NeuraLadder dataset show thatObserve-R1 outperforms a series of larger reasoning models on both reasoningand general benchmarks, achieving superior clarity and conciseness in reasoningchains. Ablation studies validate the effectiveness of our strategies,highlighting the robustness and generalization of our approach. The dataset andcode will be released at https://github.com/zrguo/Observe-R1.

Quick Read (beta)

loading the full paper ...