Abstract
Interpretability in reinforcement learning is crucial for ensuring AI systemsalign with human values and fulfill the diverse related requirements includingsafety, robustness and fairness. Building on recent approaches to encouragingsparsity and locality in neural networks, we demonstrate how the penalisationof non-local weights leads to the emergence of functionally independent modulesin the policy network of a reinforcement learning agent. To illustrate this, wedemonstrate the emergence of two parallel modules for assessment of movementalong the X and Y axes in a stochastic Minigrid environment. Through the novelapplication of community detection algorithms, we show how these modules can beautomatically identified and their functional roles verified through directintervention on the network weights prior to inference. This establishes ascalable framework for reinforcement learning interpretability throughfunctional modularity, addressing challenges regarding the trade-off betweencompleteness and cognitive tractability of reinforcement learning explanations.