Abstract
We describe a procedure for explaining neurons in deep representations byidentifying compositional logical concepts that closely approximate neuronbehavior. Compared to prior work that uses atomic labels as explanations,analyzing neurons compositionally allows us to more precisely and expressivelycharacterize their behavior. We use this procedure to answer several questionson interpretability in models for vision and natural language processing.First, we examine the kinds of abstractions learned by neurons. In imageclassification, we find that many neurons learn highly abstract butsemantically coherent visual concepts, while other polysemantic neurons detectmultiple unrelated features; in natural language inference (NLI), neurons learnshallow lexical heuristics from dataset biases. Second, we see whethercompositional explanations give us insight into model performance: visionneurons that detect human-interpretable concepts are positively correlated withtask performance, while NLI neurons that fire for shallow heuristics arenegatively correlated with task performance. Finally, we show how compositionalexplanations provide an accessible way for end users to produce simple"copy-paste" adversarial examples that change model behavior in predictableways.