Abstract
Reinforcement learning (RL) has demonstrated immense potential in advancingartificial general intelligence, agentic intelligence, and embodiedintelligence. However, the inherent heterogeneity and dynamicity of RLworkflows often lead to low hardware utilization and slow training on existingsystems. In this paper, we present RLinf, a high-performance RL training systembased on our key observation that the major roadblock to efficient RL traininglies in system flexibility. To maximize flexibility and efficiency, RLinf isbuilt atop a novel RL system design paradigm called macro-to-micro flowtransformation (M2Flow), which automatically breaks down high-level,easy-to-compose RL workflows at both the temporal and spatial dimensions, andrecomposes them into optimized execution flows. Supported by RLinf worker'sadaptive communication capability, we devise context switching and elasticpipelining to realize M2Flow transformation, and a profiling-guided schedulingpolicy to generate optimal execution plans. Extensive evaluations on bothreasoning RL and embodied RL tasks demonstrate that RLinf consistentlyoutperforms state-of-the-art systems, achieving 1.1x-2.13x speedup inend-to-end training throughput.