Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition

Abstract

Object-context shortcuts remain a persistent challenge in vision-languagemodels, undermining zero-shot reliability when test-time scenes differ fromfamiliar training co-occurrences. We recast this issue as a causal inferenceproblem and ask: Would the prediction remain if the object appeared in adifferent environment? To answer this at inference time, we estimate object andbackground expectations within CLIP's representation space, and synthesizecounterfactual embeddings by recombining object features with diversealternative contexts sampled from external datasets, batch neighbors, ortext-derived descriptions. By estimating the Total Direct Effect and simulatingintervention, we further subtract background-only activation, preservingbeneficial object-context interactions while mitigating hallucinated scores.Without retraining or prompt design, our method substantially improves bothworst-group and average accuracy on context-sensitive benchmarks, establishinga new zero-shot state of the art. Beyond performance, our framework provides alightweight representation-level counterfactual approach, offering a practicalcausal avenue for debiased and reliable multimodal reasoning.

Quick Read (beta)

loading the full paper ...