Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scales

Abstract

Reinforcement learning (RL) training is inherently unstable due to factorssuch as moving targets and high gradient variance. Reinforcement Learning fromHuman Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) canintroduce additional difficulty. Differing preferences can complicate thealignment process, and prediction errors in a trained reward model can becomemore severe as the LLM generates unseen outputs. To enhance trainingrobustness, RL has adopted techniques from supervised learning, such asensembles and layer normalization. In this work, we improve the stability of RLtraining by adapting the reverse cross entropy (RCE) from supervised learningfor noisy data to define a symmetric RL loss. We demonstrate performanceimprovements across various tasks and scales. We conduct experiments indiscrete action tasks (Atari games) and continuous action space tasks (MuJoCobenchmark and Box2D) using Symmetric A2C (SA2C) and Symmetric PPO (SPPO), withand without added noise with especially notable performance in SPPO acrossdifferent hyperparameters. Furthermore, we validate the benefits of thesymmetric RL loss when using SPPO for large language models through improvedperformance in RLHF tasks, such as IMDB positive sentiment sentiment and TL;DRsummarization tasks.

Quick Read (beta)

loading the full paper ...