Abstract
On-policy reinforcement learning (RL) algorithms perform policy updates usingi.i.d. trajectories collected by the current policy. However, after observingonly a finite number of trajectories, on-policy sampling may produce data thatfails to match the expected on-policy data distribution. This sampling errorleads to noisy updates and data inefficient on-policy learning. Recent work inthe policy evaluation setting has shown that non-i.i.d., off-policy samplingcan produce data with lower sampling error than on-policy sampling can produce(Zhong et. al, 2022). Motivated by this observation, we introduce an adaptive,off-policy sampling method to improve the data efficiency of on-policy policygradient algorithms. Our method, Proximal Robust On-Policy Sampling (PROPS),reduces sampling error by collecting data with a behavior policy that increasesthe probability of sampling actions that are under-sampled with respect to thecurrent policy. We empirically evaluate PROPS on both continuous-action MuJoCobenchmark tasks as well discrete-action tasks and demonstrate that (1) PROPSdecreases sampling error throughout training and (2) improves the dataefficiency of on-policy policy gradient algorithms.