Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning

Abstract

Recently, Large Language Models (LLMs) have rapidly evolved, approachingArtificial General Intelligence (AGI) while benefiting from large-scalereinforcement learning to enhance Human Alignment (HA) and Reasoning. Recentreward-based optimization algorithms, such as Proximal Policy Optimization(PPO) and Group Relative Policy Optimization (GRPO) have achieved significantperformance on reasoning tasks, whereas preference-based optimizationalgorithms such as Direct Preference Optimization (DPO) significantly improvethe performance of LLMs on human alignment. However, despite the strongperformance of reward-based optimization methods in alignment tasks , theyremain vulnerable to reward hacking. Furthermore, preference-based algorithms(such as Online DPO) haven't yet matched the performance of reward-basedoptimization algorithms (like PPO) on reasoning tasks, making their explorationin this specific area still a worthwhile pursuit. Motivated by thesechallenges, we propose the Trust Region Preference Approximation (TRPA)algorithm, which integrates rule-based optimization with preference-basedoptimization for reasoning tasks. As a preference-based algorithm, TRPAnaturally eliminates the reward hacking issue. TRPA constructs preferencelevels using predefined rules, forms corresponding preference pairs, andleverages a novel optimization algorithm for RL training with a theoreticalmonotonic improvement guarantee. Experimental results demonstrate that TRPA notonly achieves competitive performance on reasoning tasks but also exhibitsrobust stability. The code of this paper are released and updating onhttps://github.com/XueruiSu/Trust-Region-Preference-Approximation.git.

Quick Read (beta)

loading the full paper ...