LICORICE: Label-Efficient Concept-Based Interpretable Reinforcement Learning

Abstract

Recent advances in reinforcement learning (RL) have predominantly leveragedneural network policies for decision-making, yet these models often lackinterpretability, posing challenges for stakeholder comprehension and trust.Concept bottleneck models offer an interpretable alternative by integratinghuman-understandable concepts into policies. However, prior work assumes thatconcept annotations are readily available during training. For RL, thisrequirement poses a significant limitation: it necessitates continuousreal-time concept annotation, which either places an impractical burden onhuman annotators or incurs substantial costs in API queries and inference timewhen employing automated labeling methods. To overcome this limitation, weintroduce a novel training scheme that enables RL agents to efficiently learn aconcept-based policy by only querying annotators to label a small set of data.Our algorithm, LICORICE, involves three main contributions: interleavingconcept learning and RL training, using an ensemble to actively selectinformative data points for labeling, and decorrelating the concept data. Weshow how LICORICE reduces human labeling efforts to 500 or fewer concept labelsin three environments, and 5000 or fewer in two more complex environments, allat no cost to performance. We also explore the use of VLMs as automated conceptannotators, finding them effective in some cases but imperfect in others. Ourwork significantly reduces the annotation burden for interpretable RL, makingit more practical for real-world applications that necessitate transparency.

Quick Read (beta)

loading the full paper ...