A Theoretical Framework for Explaining Reinforcement Learning with Shapley Values

Abstract

Reinforcement learning agents can achieve super-human performance in complexdecision-making tasks, but their behaviour is often difficult to understand andexplain. This lack of explanation limits deployment, especially insafety-critical settings where understanding and trust are essential. Weidentify three core explanatory targets that together provide a comprehensiveview of reinforcement learning agents: behaviour, outcomes, and predictions. Wedevelop a unified theoretical framework for explaining these three elements ofreinforcement learning agents through the influence of individual features thatthe agent observes in its environment. We derive feature influences by usingShapley values, which collectively and uniquely satisfy a set of well-motivatedaxioms for fair and consistent credit assignment. The proposed approach,Shapley Values for Explaining Reinforcement Learning (SVERL), provides a singletheoretical framework to comprehensively and meaningfully explain reinforcementlearning agents. It yields explanations with precise semantics that are notonly interpretable but also mathematically justified, enabling us to identifyand correct conceptual issues in prior explanations. Through illustrativeexamples, we show how SVERL produces useful, intuitive explanations of agentbehaviour, outcomes, and predictions, which are not apparent from observingagent behaviour alone.

Quick Read (beta)

loading the full paper ...