Abstract
Reinforcement learning (RL) has emerged as an effective post-trainingparadigm for enhancing the reasoning capabilities of multimodal large languagemodel (MLLM). However, current RL pipelines often suffer from traininginefficiencies caused by two underexplored issues: Advantage Collapsing, wheremost advantages in a batch concentrate near zero, and Rollout Silencing, wherethe proportion of rollouts contributing non-zero gradients diminishes overtime. These issues lead to suboptimal gradient updates and hinder long-termlearning efficiency. To address these issues, we propose Shuffle-R1, a simpleyet principled framework that improves RL fine-tuning efficiency by dynamicallyrestructuring trajectory sampling and batch composition. It introduces (1)Pairwise Trajectory Sampling, which selects high-contrast trajectories withlarge advantages to improve gradient signal quality, and (2) Advantage-basedTrajectory Shuffle, which increases exposure of valuable rollouts throughinformed batch reshuffling. Experiments across multiple reasoning benchmarksshow that our framework consistently outperforms strong RL baselines withminimal overhead. These results highlight the importance of data-centricadaptations for more efficient RL training in MLLM.