The Interpretability of Codebooks in Model-Based Reinforcement Learning is Limited

Abstract

Interpretability of deep reinforcement learning systems could assistoperators with understanding how they interact with their environment. Vectorquantization methods -- also called codebook methods -- discretize a neuralnetwork's latent space that is often suggested to yield emergentinterpretability. We investigate whether vector quantization in fact providesinterpretability in model-based reinforcement learning. Our experiments,conducted in the reinforcement learning environment Crafter, show that thecodes of vector quantization models are inconsistent, have no guarantee ofuniqueness, and have a limited impact on concept disentanglement, all of whichare necessary traits for interpretability. We share insights on why vectorquantization may be fundamentally insufficient for model interpretability.

Quick Read (beta)

loading the full paper ...