Off-Policy Correction For Multi-Agent Reinforcement Learning

Abstract

Multi-agent reinforcement learning (MARL) provides a framework for problemsinvolving multiple interacting agents. Despite apparent similarity to thesingle-agent case, multi-agent problems are often harder to train and analyzetheoretically. In this work, we propose MA-Trace, a new on-policy actor-criticalgorithm, which extends V-Trace to the MARL setting. The key advantage of ouralgorithm is its high scalability in a multi-worker setting. To this end,MA-Trace utilizes importance sampling as an off-policy correction method, whichallows distributing the computations with no impact on the quality of training.Furthermore, our algorithm is theoretically grounded - we prove a fixed-pointtheorem that guarantees convergence. We evaluate the algorithm extensively onthe StarCraft Multi-Agent Challenge, a standard benchmark for multi-agentalgorithms. MA-Trace achieves high performance on all its tasks and exceedsstate-of-the-art results on some of them.

Quick Read (beta)

loading the full paper ...