PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning

Abstract

Inspired by the impressive reasoning capabilities demonstrated byreinforcement learning approaches like DeepSeek-R1, recent emerging researchhas begun exploring the use of reinforcement learning (RL) to enhancevision-language models (VLMs) for multimodal reasoning tasks. However, mostexisting multimodal reinforcement learning approaches remain limited to spatialreasoning within single-image contexts, yet still struggle to generalize tomore complex and real-world scenarios involving multi-image positionalreasoning, where understanding the relationships across images is crucial. Toaddress this challenge, we propose a general reinforcement learning approachPeRL tailored for interleaved multimodal tasks, and a multi-stage strategydesigned to enhance the exploration-exploitation trade-off, thereby improvinglearning efficiency and task performance. Specifically, we introducepermutation of image sequences to simulate varied positional relationships toexplore more spatial and positional diversity. Furthermore, we design a rolloutfiltering mechanism for resampling to focus on trajectories that contributemost to learning optimal behaviors to exploit learned policies effectively. Weevaluate our model on 5 widely-used multi-image benchmarks and 3 single-imagebenchmarks. Our experiments confirm that PeRL trained model consistentlysurpasses R1-related and interleaved VLM baselines by a large margin, achievingstate-of-the-art performance on multi-image benchmarks, while preservingcomparable performance on single-image tasks.

Quick Read (beta)

loading the full paper ...