Logarithmic Regret for Online KL-Regularized Reinforcement Learning

Abstract

Recent advances in Reinforcement Learning from Human Feedback (RLHF) haveshown that KL-regularization plays a pivotal role in improving the efficiencyof RL fine-tuning for large language models (LLMs). Despite its empiricaladvantage, the theoretical difference between KL-regularized RL and standard RLremains largely under-explored. While there is a recent line of work on thetheoretical analysis of KL-regularized objective in decision making\citep{xiong2024iterative, xie2024exploratory,zhao2024sharp}, these analyseseither reduce to the traditional RL setting or rely on strong coverageassumptions. In this paper, we propose an optimism-based KL-regularized onlinecontextual bandit algorithm, and provide a novel analysis of its regret. Bycarefully leveraging the benign optimization landscape induced by theKL-regularization and the optimistic reward estimation, our algorithm achievesan $\mathcal{O}\big(\eta\log (N_{\mathcal R} T)\cdot d_{\mathcal R}\big)$logarithmic regret bound, where $\eta, N_{\mathcal R},T,d_{\mathcal R}$ denotethe KL-regularization parameter, the cardinality of the reward function class,number of rounds, and the complexity of the reward function class. Furthermore,we extend our algorithm and analysis to reinforcement learning by developing anovel decomposition over transition steps and also obtain a similar logarithmicregret bound.

Quick Read (beta)

loading the full paper ...