OThink-MR1: Stimulating multimodal generalized reasoning capabilities via dynamic reinforcement learning

Abstract

Multimodal Large Language Models (MLLMs) have gained significant traction fortheir ability to process diverse input data types and generate coherent,contextually relevant outputs across various applications. While supervisedfine-tuning (SFT) has been the predominant approach to enhance MLLMcapabilities in task-specific optimization, it often falls short in fosteringcrucial generalized reasoning abilities. Although reinforcement learning (RL)holds great promise in overcoming these limitations, it encounters twosignificant challenges: (1) its generalized capacities in multimodal tasksremain largely unexplored, and (2) its training constraints, including theconstant Kullback-Leibler divergence or the clamp strategy, often result insuboptimal bottlenecks. To address these challenges, we propose OThink-MR1, anadvanced MLLM equipped with profound comprehension and reasoning capabilitiesacross multimodal tasks. Specifically, we introduce Group Relative PolicyOptimization with a dynamic Kullback-Leibler strategy (GRPO-D), which markedlyenhances reinforcement learning (RL) performance. For Qwen2-VL-2B-Instruct,GRPO-D achieves a relative improvement of more than 5.72% over SFT and morethan 13.59% over GRPO in same-task evaluation on two adapted datasets.Furthermore, GRPO-D demonstrates remarkable cross-task generalizationcapabilities, with an average relative improvement of more than 61.63% over SFTin cross-task evaluation. These results highlight that the MLLM trained withGRPO-D on one multimodal task can be effectively transferred to another task,underscoring the superior generalized reasoning capabilities of our proposedOThink-MR1 model.

Quick Read (beta)

loading the full paper ...