Revisiting Some Common Practices in Cooperative Multi-Agent Reinforcement Learning

Abstract

Many advances in cooperative multi-agent reinforcement learning (MARL) arebased on two common design principles: value decomposition and parametersharing. A typical MARL algorithm of this fashion decomposes a centralizedQ-function into local Q-networks with parameters shared across agents. Such analgorithmic paradigm enables centralized training and decentralized execution(CTDE) and leads to efficient learning in practice. Despite all the advantages,we revisit these two principles and show that in certain scenarios, e.g.,environments with a highly multi-modal reward landscape, value decomposition,and parameter sharing can be problematic and lead to undesired outcomes. Incontrast, policy gradient (PG) methods with individual policies provablyconverge to an optimal solution in these cases, which partially supports somerecent empirical observations that PG can be effective in many MARL testbeds.Inspired by our theoretical analysis, we present practical suggestions onimplementing multi-agent PG algorithms for either high rewards or diverseemergent behaviors and empirically validate our findings on a variety ofdomains, ranging from the simplified matrix and grid-world games to complexbenchmarks such as StarCraft Multi-Agent Challenge and Google ResearchFootball. We hope our insights could benefit the community towards developingmore general and more powerful MARL algorithms. Check our project website athttps://sites.google.com/view/revisiting-marl.

Quick Read (beta)

loading the full paper ...