Abstract
Risk-averse Constrained Reinforcement Learning (RaCRL) aims to learn policiesthat minimise the likelihood of rare and catastrophic constraint violationscaused by an environment's inherent randomness. In general, risk-aversion leadsto conservative exploration of the environment which typically results inconverging to sub-optimal policies that fail to adequately maximise reward or,in some cases, fail to achieve the goal. In this paper, we propose anexploration-based approach for RaCRL called Optimistic Risk-averse Actor Critic(ORAC), which constructs an exploratory policy by maximising a local upperconfidence bound of the state-action reward value function whilst minimising alocal lower confidence bound of the risk-averse state-action cost valuefunction. Specifically, at each step, the weighting assigned to the cost valueis increased or decreased if it exceeds or falls below the safety constraintvalue. This way the policy is encouraged to explore uncertain regions of theenvironment to discover high reward states whilst still satisfying the safetyconstraints. Our experimental results demonstrate that the ORAC approachprevents convergence to sub-optimal policies and improves significantly thereward-cost trade-off in various continuous control tasks such asSafety-Gymnasium and a complex building energy management environmentCityLearn.