Abstract
In Cooperative Multi-Agent Reinforcement Learning (MARL) and under thesetting of Centralized Training with Decentralized Execution (CTDE), agentsobserve and interact with their environment locally and independently. Withlocal observation and random sampling, the randomness in rewards andobservations leads to randomness in long-term returns. Existing methods such asValue Decomposition Network (VDN) and QMIX estimate the value of long-termreturns as a scalar that does not contain the information of randomness. Ourproposed model QR-MIX introduces quantile regression, modeling jointstate-action values as a distribution, combining QMIX with Implicit QuantileNetwork (IQN). However, the monotonicity in QMIX limits the expression of jointstate-action value distribution and may lead to incorrect estimation results innon-monotonic cases. Therefore, we proposed a flexible loss function toapproximate the monotonicity found in QMIX. Our model is not only more tolerantof the randomness of returns, but also more tolerant of the randomness ofmonotonic constraints. The experimental results demonstrate that QR-MIXoutperforms the previous state-of-the-art method QMIX in the StarCraftMulti-Agent Challenge (SMAC) environment.