Curriculum Learning With Counterfactual Group Relative Policy Advantage For Multi-Agent Reinforcement Learning

Abstract

Multi-agent reinforcement learning (MARL) has achieved strong performance incooperative adversarial tasks. However, most existing methods typically trainagents against fixed opponent strategies and rely on such meta-staticdifficulty conditions, which limits their adaptability to changing environmentsand often leads to suboptimal policies. Inspired by the success of curriculumlearning (CL) in supervised tasks, we propose a dynamic CL framework for MARLthat employs an self-adaptive difficulty adjustment mechanism. This mechanismcontinuously modulates opponent strength based on real-time agent trainingperformance, allowing agents to progressively learn from easier to morechallenging scenarios. However, the dynamic nature of CL introduces instabilitydue to nonstationary environments and sparse global rewards. To address thischallenge, we develop a Counterfactual Group Relative Policy Advantage (CGRPA),which is tightly coupled with the curriculum by providing intrinsic creditsignals that reflect each agent's impact under evolving task demands. CGRPAconstructs a counterfactual advantage function that isolates individualcontributions within group behavior, facilitating more reliable policy updatesthroughout the curriculum. CGRPA evaluates each agent's contribution throughconstructing counterfactual action advantage function, providing intrinsicrewards that enhance credit assignment and stabilize learning undernon-stationary conditions. Extensive experiments demonstrate that our methodimproves both training stability and final performance, achieving competitiveresults against state-of-the-art methods. The code is available athttps://github.com/NICE-HKU/CL2MARL-SMAC.

Quick Read (beta)

loading the full paper ...