Off-Policy Actor-Critic with Emphatic Weightings

Abstract

A variety of theoretically-sound policy gradient algorithms exist for theon-policy setting due to the policy gradient theorem, which provides asimplified form for the gradient. The off-policy setting, however, has beenless clear due to the existence of multiple objectives and the lack of anexplicit off-policy policy gradient theorem. In this work, we unify theseobjectives into one off-policy objective, and provide a policy gradient theoremfor this unified objective. The derivation involves emphatic weightings andinterest functions. We show multiple strategies to approximate the gradients,in an algorithm called Actor Critic with Emphatic weightings (ACE). We prove ina counterexample that previous (semi-gradient) off-policy actor-criticmethods--particularly OffPAC and DPG--converge to the wrong solution whereasACE finds the optimal solution. We also highlight why these semi-gradientapproaches can still perform well in practice, suggesting strategies forvariance reduction in ACE. We empirically study several variants of ACE on twoclassic control environments and an image-based environment designed toillustrate the tradeoffs made by each gradient approximation. We find that byapproximating the emphatic weightings directly, ACE performs as well as orbetter than OffPAC in all settings tested.

Quick Read (beta)

loading the full paper ...