Abstract
Interpretability of deep reinforcement learning systems could assistoperators with understanding how they interact with their environment. Vectorquantization methods -- also called codebook methods -- discretize a neuralnetwork's latent space that is often suggested to yield emergentinterpretability. We investigate whether vector quantization in fact providesinterpretability in model-based reinforcement learning. Our experiments,conducted in the reinforcement learning environment Crafter, show that thecodes of vector quantization models are inconsistent, have no guarantee ofuniqueness, and have a limited impact on concept disentanglement, all of whichare necessary traits for interpretability. We share insights on why vectorquantization may be fundamentally insufficient for model interpretability.