Negative Update Intervals in Deep Multi-Agent Reinforcement Learning

Abstract

In Multi-Agent Reinforcement Learning, independent cooperative learners mustovercome a number of pathologies in order to learn optimal joint policies.These pathologies include action-shadowing, stochasticity, the moving targetand alter-exploration problems (Matignon, Laurent, and Le Fort-Piat 2012; Weiand Luke 2016). Numerous methods have been proposed to address thesepathologies, but evaluations are predominately conducted in repeatedstrategic-form games and stochastic games consisting of only a small number ofstate transitions. This raises the question of the scalability of the methodsto complex, temporally extended, partially observable domains with stochastictransitions and rewards. In this paper we study such complex settings, whichrequire reasoning over long time horizons and confront agents with the curse ofdimensionality. To deal with the dimensionality, we adopt a Multi-Agent DeepReinforcement Learning (MA-DRL) approach. We find that when the agents have tomake critical decisions in seclusion, existing methods succumb to a combinationof relative overgeneralisation (a type of action shadowing), thealter-exploration problem, and the stochasticity. To address these pathologieswe introduce expanding negative update intervals that enable independentlearners to establish the near-optimal average utility values for higher-levelstrategies while largely discarding transitions from episodes that result inmis-coordination. We evaluate Negative Update Intervals Double-DQN (NUI-DDQN)within a temporally extended Climb Game, a normal form game which hasfrequently been used to study relative over-generalisation and otherpathologies. We show that NUI-DDQN can converge towards optimal joint-policiesin deterministic and stochastic reward settings, overcomingrelative-overgeneralisation and the alter-exploration problem while mitigatingthe moving target problem.

Quick Read (beta)

loading the full paper ...