Abstract
This paper presents a pioneering exploration of reinforcement learning (RL)via group relative policy optimization for unified multimodal large languagemodels (ULMs), aimed at simultaneously reinforcing generation and understandingcapabilities. Through systematic pilot studies, we uncover the significantpotential of ULMs to enable the synergistic co-evolution of dual capabilitieswithin a shared policy optimization framework. Building on this insight, weintroduce CoRL, a co-reinforcement learning framework comprising a unified RLstage for joint optimization and a refined RL stage for task-specificenhancement. With the proposed CoRL, our resulting model, ULM-R1, achievesaverage improvements of 7% on three text-to-image generation datasets and 23%on nine multimodal understanding benchmarks. These results demonstrate theeffectiveness of CoRL and highlight the substantial benefit of reinforcementlearning in facilitating cross-task synergy and optimization for ULMs. Code isavailable at https://github.com/mm-vl/ULM-R1.