Improving Actor-Critic Reinforcement Learning via Hamiltonian Policy

Abstract

Approximating optimal policies in reinforcement learning (RL) is oftennecessary in many real-world scenarios, which is termed as policy optimization.By viewing the reinforcement learning from the perspective of variationalinference (VI), the policy network is trained to obtain the approximateposterior of actions given the optimality criteria. However, in practice, thepolicy optimization may lead to suboptimal policy estimates due to theamortization gap and insufficient exploration. In this work, inspired by theprevious use of Hamiltonian Monte Carlo (HMC) in VI, we propose to integratepolicy optimization with HMC. As such we choose evolving actions from the basepolicy according to HMC, which has two benefits: i) HMC can improve the policydistribution to better approximate the posterior and hence reduces theamortization gap; ii) HMC can also guide the exploration more to the regionswith higher action values, enhancing the exploration efficiency. Instead ofdirectly applying HMC into RL, we propose a new leapfrog operator to simulatethe Hamiltonian dynamics. With comprehensive empirical experiments oncontinuous control baselines, including MuJoCo and PyBullet Roboschool, we showthat the proposed approach is a data-efficient, and an easy-to-implementimprovement over previous policy optimization methods. Besides, the proposedapproach can also outperform previous methods on DeepMind Control Suite whichhas image-based high-dimensional observation space.

Quick Read (beta)

loading the full paper ...