RMIX: Learning Risk-Sensitive Policies for Cooperative Reinforcement Learning Agents

Abstract

Current value-based multi-agent reinforcement learning methods optimizeindividual Q values to guide individuals' behaviours via centralized trainingwith decentralized execution (CTDE). However, such expected, i.e.,risk-neutral, Q value is not sufficient even with CTDE due to the randomness ofrewards and the uncertainty in environments, which causes the failure of thesemethods to train coordinating agents in complex environments. To address theseissues, we propose RMIX, a novel cooperative MARL method with the ConditionalValue at Risk (CVaR) measure over the learned distributions of individuals' Qvalues. Specifically, we first learn the return distributions of individuals toanalytically calculate CVaR for decentralized execution. Then, to handle thetemporal nature of the stochastic outcomes during executions, we propose adynamic risk level predictor for risk level tuning. Finally, we optimize theCVaR policies with CVaR values used to estimate the target in TD error duringcentralized training and the CVaR values are used as auxiliary local rewards toupdate the local distribution via Quantile Regression loss. Empirically, weshow that our method significantly outperforms state-of-the-art methods onchallenging StarCraft II tasks, demonstrating enhanced coordination andimproved sample efficiency.

Quick Read (beta)

loading the full paper ...