VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives

Abstract

Many past works aim to improve visual reasoning in models by supervisingfeature importance (estimated by model explanation techniques) with humanannotations such as highlights of important image regions. However, recent workhas shown that performance gains from feature importance (FI) supervision forVisual Question Answering (VQA) tasks persist even with random supervision,suggesting that these methods do not meaningfully align model FI with human FI.In this paper, we show that model FI supervision can meaningfully improve VQAmodel accuracy as well as performance on several Right-for-the-Right-Reason(RRR) metrics by optimizing for four key model objectives: (1) accuratepredictions given limited but sufficient information (Sufficiency); (2)max-entropy predictions given no important information (Uncertainty); (3)invariance of predictions to changes in unimportant features (Invariance); and(4) alignment between model FI explanations and human FI explanations(Plausibility). Our best performing method, Visual Feature ImportanceSupervision (VisFIS), outperforms strong baselines on benchmark VQA datasets interms of both in-distribution and out-of-distribution accuracy. While past worksuggests that the mechanism for improved accuracy is through improvedexplanation plausibility, we show that this relationship depends crucially onexplanation faithfulness (whether explanations truly represent the model'sinternal reasoning). Predictions are more accurate when explanations areplausible and faithful, and not when they are plausible but not faithful.Lastly, we show that, surprisingly, RRR metrics are not predictive ofout-of-distribution model accuracy when controlling for a model'sin-distribution accuracy, which calls into question the value of these metricsfor evaluating model reasoning. All supporting code is available athttps://github.com/zfying/visfis

Quick Read (beta)

loading the full paper ...