Inherently Faithful Attention Maps for Vision Transformers

Abstract

We introduce an attention-based method that uses learned binary attentionmasks to ensure that only attended image regions influence the prediction.Context can strongly affect object perception, sometimes leading to biasedrepresentations, particularly when objects appear in out-of-distributionbackgrounds. At the same time, many image-level object-centric tasks requireidentifying relevant regions, often requiring context. To address thisconundrum, we propose a two-stage framework: stage 1 processes the full imageto discover object parts and identify task-relevant regions, while stage 2leverages input attention masking to restrict its receptive field to theseregions, enabling a focused analysis while filtering out potentially spuriousinformation. Both stages are trained jointly, allowing stage 2 to refine stage1. Extensive experiments across diverse benchmarks demonstrate that ourapproach significantly improves robustness against spurious correlations andout-of-distribution backgrounds. Code: https://github.com/ananthu-aniraj/ifam

Quick Read (beta)

loading the full paper ...