A Theoretical Framework for Explaining Reinforcement Learning with Shapley Values

Abstract

Reinforcement learning agents can achieve superhuman performance, but theirdecisions are often difficult to interpret. This lack of transparency limitsdeployment, especially in safety-critical settings where human trust andaccountability are essential. In this work, we develop a theoretical frameworkfor explaining reinforcement learning through the influence of state features,which represent what the agent observes in its environment. We identify threecore elements of the agent-environment interaction that benefit fromexplanation: behaviour (what the agent does), performance (what the agentachieves), and value estimation (what the agent expects to achieve). We treatstate features as players cooperating to produce each element and apply Shapleyvalues, a principled method from cooperative game theory, to identify theinfluence of each feature. This approach yields a family of mathematicallygrounded explanations with clear semantics and theoretical guarantees. We useillustrative examples to show how these explanations align with human intuitionand reveal novel insights. Our framework unifies and extends prior work, makingexplicit the assumptions behind existing approaches, and offers a principledfoundation for more interpretable and trustworthy reinforcement learning.

Quick Read (beta)

loading the full paper ...