Provably Efficient Generalized Lagrangian Policy Optimization for Safe Multi-Agent Reinforcement Learning

Abstract

We examine online safe multi-agent reinforcement learning using constrainedMarkov games in which agents compete by maximizing their expected total rewardsunder a constraint on expected total utilities. Our focus is confined to anepisodic two-player zero-sum constrained Markov game with independenttransition functions that are unknown to agents, adversarial reward functions,and stochastic utility functions. For such a Markov game, we employ an approachbased on the occupancy measure to formulate it as an online constrainedsaddle-point problem with an explicit constraint. We extend the Lagrangemultiplier method in constrained optimization to handle the constraint bycreating a generalized Lagrangian with minimax decision primal variables and adual variable. Next, we develop an upper confidence reinforcement learningalgorithm to solve this Lagrangian problem while balancing exploration andexploitation. Our algorithm updates the minimax decision primal variables viaonline mirror descent and the dual variable via projected gradient step and weprove that it enjoys sublinear rate $ O((|X|+|Y|) L \sqrt{T(|A|+|B|)}))$ forboth regret and constraint violation after playing $T$ episodes of the game.Here, $L$ is the horizon of each episode, $(|X|,|A|)$ and $(|Y|,|B|)$ are thestate/action space sizes of the min-player and the max-player, respectively. Tothe best of our knowledge, we provide the first provably efficient online safereinforcement learning algorithm in constrained Markov games.

Quick Read (beta)

loading the full paper ...