Unlearning-based Neural Interpretations

Abstract

Gradient-based interpretations often require an anchor point of comparison toavoid saturation in computing feature importance. We show that currentbaselines defined using static functions--constant mapping, averaging orblurring--inject harmful colour, texture or frequency assumptions that deviatefrom model behaviour. This leads to accumulation of irregular gradients,resulting in attribution maps that are biased, fragile and manipulable.Departing from the static approach, we propose UNI to compute an (un)learnable,debiased and adaptive baseline by perturbing the input towards an unlearningdirection of steepest ascent. Our method discovers reliable baselines andsucceeds in erasing salient features, which in turn locally smooths thehigh-curvature decision boundaries. Our analyses point to unlearning as apromising avenue for generating faithful, efficient and robust interpretations.

Quick Read (beta)

loading the full paper ...