Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss

Abstract

We consider online learning for episodic stochastically constrained Markovdecision processes (CMDPs), which plays a central role in ensuring the safetyof reinforcement learning. Here the loss function can vary arbitrarily acrossthe episodes, and both the loss received and the budget consumption arerevealed at the end of each episode. Previous works solve this problem underthe restrictive assumption that the transition model of the Markov decisionprocesses (MDPs) is known a priori and establish regret bounds that dependpolynomially on the cardinalities of the state space $\mathcal{S}$ and theaction space $\mathcal{A}$. In this work, we propose a new \emph{upperconfidence primal-dual} algorithm, which only requires the trajectories sampledfrom the transition model. In particular, we prove that the proposed algorithmachieves $\widetilde{\mathcal{O}}(L|\mathcal{S}|\sqrt{|\mathcal{A}|T})$ upperbounds of both the regret and the constraint violation, where $L$ is the lengthof each episode. Our analysis incorporates a new high-probability driftanalysis of Lagrange multiplier processes into the celebrated regret analysisof upper confidence reinforcement learning, which demonstrates the power of"optimism in the face of uncertainty" in constrained online learning.

Quick Read (beta)

loading the full paper ...