Beyond Sparse Rewards: Enhancing Reinforcement Learning with Language Model Critique in Text Generation

Abstract

Reinforcement learning (RL) can align language models with non-differentiablereward signals, such as human preferences. However, a major challenge arisesfrom the sparsity of these reward signals - typically, there is only a singlereward for an entire output. This sparsity of rewards can lead to inefficientand unstable learning. To address this challenge, our paper introduces an novelframework that utilizes the critique capability of Large Language Models (LLMs)to produce intermediate-step rewards during RL training. Our method involvescoupling a policy model with a critic language model, which is responsible forproviding comprehensive feedback of each part of the output. This feedback isthen translated into token or span-level rewards that can be used to guide theRL training process. We investigate this approach under two different settings:one where the policy model is smaller and is paired with a more powerful criticmodel, and another where a single language model fulfills both roles. We assessour approach on three text generation tasks: sentiment control, language modeldetoxification, and summarization. Experimental results show that incorporatingartificial intrinsic rewards significantly improve both sample efficiency andthe overall performance of the policy model, supported by both automatic andhuman evaluation.

Quick Read (beta)

loading the full paper ...