Independent Policy Gradient Methods for Competitive Reinforcement Learning

Abstract

We obtain global, non-asymptotic convergence guarantees for independentlearning algorithms in competitive reinforcement learning settings with twoagents (i.e., zero-sum stochastic games). We consider an episodic setting wherein each episode, each player independently selects a policy and observes onlytheir own actions and rewards, along with the state. We show that if bothplayers run policy gradient methods in tandem, their policies will converge toa min-max equilibrium of the game, as long as their learning rates follow atwo-timescale rule (which is necessary). To the best of our knowledge, thisconstitutes the first finite-sample convergence result for independent policygradient methods in competitive RL; prior work has largely focused oncentralized, coordinated procedures for equilibrium computation.

Quick Read (beta)

loading the full paper ...