One RL to See Them All: Visual Triple Unified Reinforcement Learning

Abstract

Reinforcement learning (RL) has significantly advanced the reasoningcapabilities of vision-language models (VLMs). However, the use of RL beyondreasoning tasks remains largely unexplored, especially for perceptionintensivetasks like object detection and grounding. We propose V-Triune, a Visual TripleUnified Reinforcement Learning system that enables VLMs to jointly learn visualreasoning and perception tasks within a single training pipeline. V-Triunecomprises triple complementary components: Sample-Level Data Formatting (tounify diverse task inputs), Verifier-Level Reward Computation (to delivercustom rewards via specialized verifiers) , and Source-Level Metric Monitoring(to diagnose problems at the data-source level). We further introduce a novelDynamic IoU reward, which provides adaptive, progressive, and definite feedbackfor perception tasks handled by V-Triune. Our approach is instantiated withinoff-the-shelf RL training framework using open-source 7B and 32B backbonemodels. The resulting model, dubbed Orsta (One RL to See Them All),demonstrates consistent improvements across both reasoning and perceptiontasks. This broad capability is significantly shaped by its training on adiverse dataset, constructed around four representative visual reasoning tasks(Math, Puzzle, Chart, and Science) and four visual perception tasks (Grounding,Detection, Counting, and OCR). Subsequently, Orsta achieves substantial gainson MEGA-Bench Core, with improvements ranging from +2.1 to an impressive +14.1across its various 7B and 32B model variants, with performance benefitsextending to a wide range of downstream tasks. These results highlight theeffectiveness and scalability of our unified RL approach for VLMs. The V-Triunesystem, along with the Orsta models, is publicly available athttps://github.com/MiniMax-AI.

Quick Read (beta)

loading the full paper ...