An Interpretability Evaluation Benchmark for Pre-trained Language Models

Abstract

While pre-trained language models (LMs) have brought great improvements inmany NLP tasks, there is increasing attention to explore capabilities of LMsand interpret their predictions. However, existing works usually focus only ona certain capability with some downstream tasks. There is a lack of datasetsfor directly evaluating the masked word prediction performance and theinterpretability of pre-trained LMs. To fill in the gap, we propose a novelevaluation benchmark providing with both English and Chinese annotated data. Ittests LMs abilities in multiple dimensions, i.e., grammar, semantics,knowledge, reasoning and computation. In addition, it provides carefullyannotated token-level rationales that satisfy sufficiency and compactness. Itcontains perturbed instances for each original instance, so as to use therationale consistency under perturbations as the metric for faithfulness, aperspective of interpretability. We conduct experiments on several widely-usedpre-trained LMs. The results show that they perform very poorly on thedimensions of knowledge and computation. And their plausibility in alldimensions is far from satisfactory, especially when the rationale is short. Inaddition, the pre-trained LMs we evaluated are not robust on syntax-aware data.We will release this evaluation benchmark at \url{http://xyz}, and hope it canfacilitate the research progress of pre-trained LMs.

Quick Read (beta)

loading the full paper ...