Average-Reward Reinforcement Learning with Trust Region Methods

Abstract

Most of reinforcement learning algorithms optimize the discounted criterionwhich is beneficial to accelerate the convergence and reduce the variance ofestimates. Although the discounted criterion is appropriate for certain taskssuch as financial related problems, many engineering problems treat futurerewards equally and prefer a long-run average criterion. In this paper, westudy the reinforcement learning problem with the long-run average criterion.Firstly, we develop a unified trust region theory with discounted and averagecriteria. With the average criterion, a novel performance bound within thetrust region is derived with the Perturbation Analysis (PA) theory. Secondly,we propose a practical algorithm named Average Policy Optimization (APO), whichimproves the value estimation with a novel technique named Average ValueConstraint. To the best of our knowledge, our work is the first one to studythe trust region approach with the average criterion and it complements theframework of reinforcement learning beyond the discounted criterion. Finally,experiments are conducted in the continuous control environment MuJoCo. In mosttasks, APO performs better than the discounted PPO, which demonstrates theeffectiveness of our approach.

Quick Read (beta)

loading the full paper ...