Abstract
Existing distributed cooperative multi-agent reinforcement learning (MARL)frameworks usually assume undirected coordination graphs and communicationgraphs while estimating a global reward via consensus algorithms for policyevaluation. Such a framework may induce expensive communication costs andexhibit poor scalability due to requirement of global consensus. In this work,we study MARLs with directed coordination graphs, and propose a distributed RLalgorithm where the local policy evaluations are based on local valuefunctions. The local value function of each agent is obtained by localcommunication with its neighbors through a directed learning-inducedcommunication graph, without using any consensus algorithm. A zeroth-orderoptimization (ZOO) approach based on parameter perturbation is employed toachieve gradient estimation. By comparing with existing ZOO-based RLalgorithms, we show that our proposed distributed RL algorithm guarantees highscalability. A distributed resource allocation example is shown to illustratethe effectiveness of our algorithm.