Abstract
Value function factorization methods are commonly used in cooperativemulti-agent reinforcement learning, with QMIX receiving significant attention.Many QMIX-based methods introduce monotonicity constraints between the jointaction value and individual action values to achieve decentralized execution.However, such constraints limit the representation capacity of valuefactorization, restricting the joint action values it can represent andhindering the learning of the optimal policy. To address this challenge, wepropose the Potentially Optimal Joint Actions Weighted QMIX (POWQMIX)algorithm, which recognizes the potentially optimal joint actions and assignshigher weights to the corresponding losses of these joint actions duringtraining. We theoretically prove that with such a weighted training approachthe optimal policy is guaranteed to be recovered. Experiments in matrix games,difficulty-enhanced predator-prey, and StarCraft II Multi-Agent Challengeenvironments demonstrate that our algorithm outperforms the state-of-the-artvalue-based multi-agent reinforcement learning methods.