Constrained Policy Improvement for Safe and Efficient Reinforcement Learning

Abstract

We propose a policy improvement algorithm for Reinforcement Learning (RL)which is called Rerouted Behavior Improvement (RBI). RBI is designed to takeinto account the evaluation errors of the Q-function. Such errors are common inRL when learning the $Q$-value from finite past experience data. Greedypolicies or even constrained policy optimization algorithms which ignore theseerrors may suffer from an improvement penalty (i.e. a negative policyimprovement). To minimize the improvement penalty, the RBI idea is to attenuaterapid policy changes of low probability actions which were less frequentlysampled. This approach is shown to avoid catastrophic performance degradationand reduce regret when learning from a batch of past experience. Through atwo-armed bandit with Gaussian distributed rewards example, we show that italso increases data efficiency when the optimal action has a high variance. Weevaluate RBI in two tasks in the Atari Learning Environment: (1) learning fromobservations of multiple behavior policies and (2) iterative RL. Our resultsdemonstrate the advantage of RBI over greedy policies and other constrainedpolicy optimization algorithms as a safe learning approach and as a generaldata efficient learning algorithm. An anonymous Github repository of our RBIimplementation is found at https://github.com/eladsar/rbi.

Quick Read (beta)

loading the full paper ...