Relative Importance Sampling For Off-Policy Actor-Critic in Deep Reinforcement Learning

  • 2019-01-15 07:13:50
  • Mahammad Humayoo, Xueqi Cheng
  • 0

Abstract

Off-policy learning is more unstable compared to on-policy learning inreinforcement learning (RL). One reason for the instability of off-policylearning is a discrepancy between the target ($\pi$) and behavior (b) policydistributions. The discrepancy between $\pi$ and b distributions can bealleviated by employing a smooth variant of the importance sampling (IS), suchas the relative importance sampling (RIS). RIS has parameter $\beta\in[0, 1]$which controls smoothness. To cope with instability, we present the firstrelative importance sampling-off-policy actor-critic (RIS-Off-PAC) model-freealgorithms in RL. In our method, the network yields a target policy (theactor), a value function (the critic) assessing the current policy ($\pi$), andbehavior policy. We use action value generated from the behavior policy totrain our algorithm rather than from the target policy. We also use deep neuralnetworks to train both actor and critic. We evaluated our algorithm on a numberof Open AI Gym benchmark problems and demonstrate better or comparableperformance to several state-of-the-art RL baselines.

 

Quick Read (beta)

loading the full paper ...