Abstract
Multimodal Large Language Models (MLLMs) have gained significant traction fortheir ability to process diverse input data types and generate coherent,contextually relevant outputs across various applications. While supervisedfine-tuning (SFT) has been the predominant approach to enhance MLLMcapabilities in task-specific optimization, it often falls short in fosteringcrucial generalized reasoning abilities. Although reinforcement learning (RL)holds great promise in overcoming these limitations, it encounters twosignificant challenges: (1) its generalized capacities in multimodal tasksremain largely unexplored, and (2) its training constraints, including theconstant Kullback-Leibler divergence or the clamp strategy, often result insuboptimal bottlenecks. To address these challenges, we propose OThink-MR1, anadvanced MLLM equipped with profound comprehension and reasoning capabilitiesacross multimodal tasks. Specifically, we introduce Group Relative PolicyOptimization with a dynamic Kullback-Leibler strategy (GRPO-D), which markedlyenhances reinforcement learning (RL) performance. For Qwen2-VL-2B-Instruct,GRPO-D achieves a relative improvement of more than 5.72% over SFT and morethan 13.59% over GRPO in same-task evaluation on two adapted datasets.Furthermore, GRPO-D demonstrates remarkable cross-task generalizationcapabilities, with an average relative improvement of more than 61.63% over SFTin cross-task evaluation. These results highlight that the MLLM trained withGRPO-D on one multimodal task can be effectively transferred to another task,underscoring the superior generalized reasoning capabilities of our proposedOThink-MR1 model.