Towards Automated Semantic Interpretability in Reinforcement Learning via Vision-Language Models

Abstract

Semantic interpretability in Reinforcement Learning (RL) enables transparencyand verifiability by making the agent's decisions understandable andverifiable. Achieving this, however, requires a feature space composed ofhuman-understandable concepts, which traditionally rely on human specificationand may fail to generalize to unseen environments. We introduce interpretableTree-based Reinforcement learning via Automated Concept Extraction (iTRACE), anautomated framework that leverages pre-trained vision-language models (VLM) forsemantic feature extraction and interpretable tree-based models for policyoptimization. iTRACE first extracts semantically meaningful features, then mapsthem to policies via interpretable trees. To address the impracticality ofrunning VLMs in RL loops, we distill their outputs into a lightweight model. Byleveraging Vision-Language Models (VLMs) to automate tree-based reinforcementlearning, iTRACE eliminates the need for human annotation traditionallyrequired by interpretable models, while also addressing the limitations of VLMsalone, such as their lack of grounding in action spaces and inability todirectly optimize policies. iTRACE outperforms MLP baselines that use the sameinterpretable features and matches the performance of CNN-based policies,producing verifiable, semantically interpretable, and human-aligned behaviorswithout requiring human annotation.

Quick Read (beta)

loading the full paper ...