Counterfactual Multi-Agent Policy Gradients

Abstract

Cooperative multi-agent systems can be naturally used to model many realworld problems, such as network packet routing and the coordination ofautonomous vehicles. There is a great need for new reinforcement learningmethods that can efficiently learn decentralised policies for such systems. Tothis end, we propose a new multi-agent actor-critic method calledcounterfactual multi-agent (COMA) policy gradients. COMA uses a centralisedcritic to estimate the Q-function and decentralised actors to optimise theagents' policies. In addition, to address the challenges of multi-agent creditassignment, it uses a counterfactual baseline that marginalises out a singleagent's action, while keeping the other agents' actions fixed. COMA also uses acritic representation that allows the counterfactual baseline to be computedefficiently in a single forward pass. We evaluate COMA in the testbed ofStarCraft unit micromanagement, using a decentralised variant with significantpartial observability. COMA significantly improves average performance overother multi-agent actor-critic methods in this setting, and the best performingagents are competitive with state-of-the-art centralised controllers that getaccess to the full state.

Quick Read (beta)

loading the full paper ...