Value Propagation for Decentralized Networked Deep Multi-agent Reinforcement Learning

Abstract

We consider the networked multi-agent reinforcement learning (MARL) problemin a fully decentralized setting, where agents learn to coordinate to achievethe joint success. This problem is widely encountered in many areas includingtraffic control, distributed control, and smart grids. We assume that thereward function for each agent can be different and observed only locally bythe agent itself. Furthermore, each agent is located at a node of acommunication network and can exchanges information only with its neighbors.Using softmax temporal consistency and a decentralized optimization method, weobtain a principled and data-efficient iterative algorithm. In the first stepof each iteration, an agent computes its local policy and value gradients andthen updates only policy parameters. In the second step, the agent propagatesto its neighbors the messages based on its value function and then updates itsown value function. Hence we name the algorithm value propagation. We prove anon-asymptotic convergence rate 1/T with the nonlinear function approximation.To the best of our knowledge, it is the first MARL algorithm with convergenceguarantee in the control, off-policy and non-linear function approximationsetting. We empirically demonstrate the effectiveness of our approach inexperiments.

Quick Read (beta)

loading the full paper ...