Revisiting Safe Exploration in Safe Reinforcement learning

Abstract

Safe reinforcement learning (SafeRL) extends standard reinforcement learningwith the idea of safety, where safety is typically defined through theconstraint of the expected cost return of a trajectory being below a set limit.However, this metric fails to distinguish how costs accrue, treating infrequentsevere cost events as equal to frequent mild ones, which can lead to riskierbehaviors and result in unsafe exploration. We introduce a new metric, expectedmaximum consecutive cost steps (EMCC), which addresses safety during trainingby assessing the severity of unsafe steps based on their consecutiveoccurrence. This metric is particularly effective for distinguishing betweenprolonged and occasional safety violations. We apply EMMC in both on- andoff-policy algorithm for benchmarking their safe exploration capability.Finally, we validate our metric through a set of benchmarks and propose a newlightweight benchmark task, which allows fast evaluation for algorithm design.

Quick Read (beta)

loading the full paper ...