Towards Behavior-Level Explanation for Deep Reinforcement Learning

Abstract

While Deep Neural Networks (DNNs) are becoming the state-of-the-art for manytasks including reinforcement learning (RL), they are especially resistant tohuman scrutiny and understanding. Input attributions have been a foundationalbuilding block for DNN expalainabilty but face new challenges when applied todeep RL. We address the challenges with two novel techniques. We define a classof \emph{behaviour-level attributions} for explaining agent behaviour beyondinput importance and interpret existing attribution methods on the behaviourlevel. We then introduce \emph{$\lambda$-alignment}, a metric for evaluatingthe performance of behaviour-level attributions methods in terms of whetherthey are indicative of the agent actions they are meant to explain. Ourexperiments on Atari games suggest that perturbation-based attribution methodsare significantly more suitable to deep RL than alternatives from theperspective of this metric. We argue that our methods demonstrate the minimalset of considerations for adopting general DNN explanation technology to theunique aspects of reinforcement learning and hope the outlined direction canserve as a basis for future research on understanding Deep RL usingattribution.

Quick Read (beta)

loading the full paper ...