Compositional Concept-Based Neuron-Level Interpretability for Deep Reinforcement Learning

Abstract

Deep reinforcement learning (DRL), through learning policies or valuesrepresented by neural networks, has successfully addressed many complex controlproblems. However, the neural networks introduced by DRL lack interpretabilityand transparency. Current DRL interpretability methods largely treat neuralnetworks as black boxes, with few approaches delving into the internalmechanisms of policy/value networks. This limitation undermines trust in boththe neural network models that represent policies and the explanations derivedfrom them. In this work, we propose a novel concept-based interpretabilitymethod that provides fine-grained explanations of DRL models at the neuronlevel. Our method formalizes atomic concepts as binary functions over the statespace and constructs complex concepts through logical operations. By analyzingthe correspondence between neuron activations and concept functions, weestablish interpretable explanations for individual neurons in policy/valuenetworks. Experimental results on both continuous control tasks and discretedecision-making environments demonstrate that our method can effectivelyidentify meaningful concepts that align with human understanding whilefaithfully reflecting the network's decision-making logic.

Quick Read (beta)

loading the full paper ...