Policy Optimization as Wasserstein Gradient Flows

Abstract

Policy optimization is a core component of reinforcement learning (RL), andmost existing RL methods directly optimize parameters of a policy based onmaximizing the expected total reward, or its surrogate. Though often achievingencouraging empirical success, its underlying mathematical principle on {\empolicy-distribution} optimization is unclear. We place policy optimization intothe space of probability measures, and interpret it as Wasserstein gradientflows. On the probability-measure space, under specified circumstances, policyoptimization becomes a convex problem in terms of distribution optimization. Tomake optimization feasible, we develop efficient algorithms by numericallysolving the corresponding discrete gradient flows. Our technique is applicableto several RL settings, and is related to many state-of-the-artpolicy-optimization algorithms. Empirical results verify the effectiveness ofour framework, often obtaining better performance compared to relatedalgorithms.

Quick Read (beta)

loading the full paper ...