A Policy Gradient Algorithm for Learning to Learn in Multiagent Reinforcement Learning

Abstract

A fundamental challenge in multiagent reinforcement learning is to learnbeneficial behaviors in a shared environment with other simultaneously learningagents. In particular, each agent perceives the environment as effectivelynon-stationary due to the changing policies of other agents. Moreover, eachagent is itself constantly learning, leading to natural non-stationarity in thedistribution of experiences encountered. In this paper, we propose a novelmeta-multiagent policy gradient theorem that directly accounts for thenon-stationary policy dynamics inherent to multiagent learning settings. Thisis achieved by modeling our gradient updates to consider both an agent's ownnon-stationary policy dynamics and the non-stationary policy dynamics of otheragents in the environment. We show that our theoretically grounded approachprovides a general solution to the multiagent learning problem, whichinherently comprises all key aspects of previous state of the art approaches onthis topic. We test our method on a diverse suite of multiagent benchmarks anddemonstrate a more efficient ability to adapt to new agents as they learn thanbaseline methods across the full spectrum of mixed incentive, competitive, andcooperative domains.

Quick Read (beta)

loading the full paper ...