Adaptive Opponent Policy Detection in Multi-Agent MDPs: Real-Time Strategy Switch Identification Using Running Error Estimation

Abstract

In Multi-agent Reinforcement Learning (MARL), accurately perceivingopponents' strategies is essential for both cooperative and adversarialcontexts, particularly within dynamic environments. While Proximal PolicyOptimization (PPO) and related algorithms such as Actor-Critic with ExperienceReplay (ACER), Trust Region Policy Optimization (TRPO), and Deep DeterministicPolicy Gradient (DDPG) perform well in single-agent, stationary environments,they suffer from high variance in MARL due to non-stationary and hiddenpolicies of opponents, leading to diminished reward performance. Additionally,existing methods in MARL face significant challenges, including the need forinter-agent communication, reliance on explicit reward information, highcomputational demands, and sampling inefficiencies. These issues render themless effective in continuous environments where opponents may abruptly changetheir policies without prior notice. Against this background, we presentOPS-DeMo (Online Policy Switch-Detection Model), an online algorithm thatemploys dynamic error decay to detect changes in opponents' policies. OPS-DeMocontinuously updates its beliefs using an Assumed Opponent Policy (AOP) Bankand selects corresponding responses from a pre-trained Response Policy Bank.Each response policy is trained against consistently strategizing opponents,reducing training uncertainty and enabling the effective use of algorithms likePPO in multi-agent environments. Comparative assessments show that our approachoutperforms PPO-trained models in dynamic scenarios like the Predator-Preysetting, providing greater robustness to sudden policy shifts and enabling moreinformed decision-making through precise opponent policy insights.

Quick Read (beta)

loading the full paper ...