On-Policy Trust Region Policy Optimisation with Replay Buffers

  • 2019-01-18 13:09:18
  • Dmitry Kangin, Nicolas Pugeault
  • 4

Abstract

Building upon the recent success of deep reinforcement learning methods, weinvestigate the possibility of on-policy reinforcement learning improvement byreusing the data from several consecutive policies. On-policy methods bringmany benefits, such as ability to evaluate each resulting policy. However, theyusually discard all the information about the policies which existed before. Inthis work, we propose adaptation of the replay buffer concept, borrowed fromthe off-policy learning setting, to create the method, combining advantages ofon- and off-policy learning. To achieve this, the proposed algorithmgeneralises the $Q$-, value and advantage functions for data from multiplepolicies. The method uses trust region optimisation, while avoiding some of thecommon problems of the algorithms such as TRPO or ACKTR: it useshyperparameters to replace the trust region selection heuristics, as well asthe trainable covariance matrix instead of the fixed one. In many cases, themethod not only improves the results comparing to the state-of-the-art trustregion on-policy learning algorithms such as PPO, ACKTR and TRPO, but also withrespect to their off-policy counterpart DDPG.

 

Quick Read (beta)

loading the full paper ...