Abstract
Recently, test-time scaling Large Language Models (LLMs) have demonstratedexceptional reasoning capabilities across scientific and professional tasks bygenerating long chains-of-thought (CoT). As a crucial component for developingthese reasoning models, reinforcement learning (RL), exemplified by ProximalPolicy Optimization (PPO) and its variants, allows models to learn throughtrial and error. However, PPO can be time-consuming due to its inherenton-policy nature, which is further exacerbated by increasing response lengths.In this work, we propose Truncated Proximal Policy Optimization (T-PPO), anovel extension to PPO that improves training efficiency by streamlining policyupdate and length-restricted response generation. T-PPO mitigates the issue oflow hardware utilization, an inherent drawback of fully synchronizedlong-generation procedures, where resources often sit idle during the waitingperiods for complete rollouts. Our contributions are two-folds. First, wepropose Extended Generalized Advantage Estimation (EGAE) for advantageestimation derived from incomplete responses while maintaining the integrity ofpolicy learning. Second, we devise a computationally optimized mechanism thatallows for the independent optimization of the policy and value models. Byselectively filtering prompt and truncated tokens, this mechanism reducesredundant computations and accelerates the training process without sacrificingconvergence performance. We demonstrate the effectiveness and efficacy of T-PPOon AIME 2024 with a 32B base model. The experimental results show that T-PPOimproves the training efficiency of reasoning LLMs by up to 2.5x andoutperforms its existing competitors.