Abstract
State of the art reinforcement learning methods sometimes encounter unsafesituations. Identifying when these situations occur is of interest both forpost-hoc analysis and during deployment, where it might be advantageous to callout to a human overseer for help. Efforts to gauge the criticality of differentpoints in time have been developed, but their accuracy is not well establisheddue to a lack of ground truth, and they are not designed to be easilyinterpretable by end users. Therefore, we seek to define a criticalityframework with both a quantifiable ground truth and a clear significance tousers. We introduce true criticality as the expected drop in reward when anagent deviates from its policy for n consecutive random actions. We alsointroduce the concept of proxy criticality, a low-overhead metric that has astatistically monotonic relationship to true criticality. Safety margins makethese interpretable, when defined as the number of random actions for whichperformance loss will not exceed some tolerance with high confidence. Wedemonstrate this approach in several environment-agent combinations; for an A3Cagent in an Atari Beamrider environment, the lowest 5% of safety marginscontain 47% of agent losses; i.e., supervising only 5% of decisions couldpotentially prevent roughly half of an agent's errors. This criticalityframework measures the potential impacts of bad decisions, even before thosedecisions are made, allowing for more effective debugging and oversight ofautonomous agents.