Abstract
Interpreting individual neurons or directions in activations space is animportant component of mechanistic interpretability. As such, many algorithmshave been proposed to automatically produce neuron explanations, but it isoften not clear how reliable these explanations are, or which methods producethe best explanations. This can be measured via crowd-sourced evaluations, butthey can often be noisy and expensive, leading to unreliable results. In thispaper, we carefully analyze the evaluation pipeline and develop acost-effective and highly accurate crowdsourced evaluation strategy. Incontrast to previous human studies that only rate whether the explanationmatches the most highly activating inputs, we estimate whether the explanationdescribes neuron activations across all inputs. To estimate this effectively,we introduce a novel application of importance sampling to determine whichinputs are the most valuable to show to raters, leading to around 30x costreduction compared to uniform sampling. We also analyze the label noise presentin crowd-sourced evaluations and propose a Bayesian method to aggregatemultiple ratings leading to a further ~5x reduction in number of ratingsrequired for the same accuracy. Finally, we use these methods to conduct alarge-scale study comparing the quality of neuron explanations produced by themost popular methods for two different vision models.