Deep Primal-Dual Reinforcement Learning: Accelerating Actor-Critic using Bellman Duality

Abstract

We develop a parameterized Primal-Dual $\pi$ Learning method based on deepneural networks for Markov decision process with large state space andoff-policy reinforcement learning. In contrast to the popular Q-learning andactor-critic methods that are based on successive approximations to thenonlinear Bellman equation, our method makes primal-dual updates to the policyand value functions utilizing the fundamental linear Bellman duality. Naiveparametrization of the primal-dual $\pi$ learning method using deep neuralnetworks would encounter two major challenges: (1) each update requirescomputing a probability distribution over the state space and is intractable;(2) the iterates are unstable since the parameterized Lagrangian function is nolonger linear. We address these challenges by proposing a relaxed Lagrangianformulation with a regularization penalty using the advantage function. We showthat the dual policy update step in our method is equivalent to the policygradient update in the actor-critic method in some special case, while thevalue updates differ substantially. The main advantage of the primal-dual $\pi$learning method lies in that the value and policy updates are closely coupledtogether using the Bellman duality and therefore more informative. Experimentson a simple cart-pole problem show that the algorithm significantly outperformsthe one-step temporal-difference actor-critic method, which is the mostrelevant benchmark method to compare with. We believe that the primal-dualupdates to the value and policy functions would expedite the learning process.The proposed methods might open a door to more efficient algorithms and sharpertheoretical analysis.

Quick Read (beta)

loading the full paper ...