Abstract
In reinforcement learning (RL), an agent must explore an initially unknownenvironment in order to learn a desired behaviour. When RL agents are deployedin real world environments, safety is of primary concern. Constrained Markovdecision processes (CMDPs) can provide long-term safety constraints; however,the agent may violate the constraints in an effort to explore its environment.This paper proposes a model-based RL algorithm called Explicit Explore,Exploit, or Escape ($E^{4}$), which extends the Explicit Explore or Exploit($E^{3}$) algorithm to a robust CMDP setting. $E^4$ explicitly separatesexploitation, exploration, and escape CMDPs, allowing targeted policies forpolicy improvement across known states, discovery of unknown states, as well assafe return to known states. $E^4$ robustly optimises these policies on theworst-case CMDP from a set of CMDP models consistent with the empiricalobservations of the deployment environment. Theoretical results show that $E^4$finds a near-optimal constraint-satisfying policy in polynomial time whilstsatisfying safety constraints throughout the learning process. We discussrobust-constrained offline optimisation algorithms as well as how toincorporate uncertainty in transition dynamics of unknown states based onempirical inference and prior knowledge.