Abstract
Language-guided attention frameworks have significantly enhanced bothinterpretability and performance in image classification; however, the relianceon deterministic embeddings from pre-trained vision-language foundation modelsto generate reference attention maps frequently overlooks the intrinsicmultivaluedness and ill-posed characteristics of cross-modal mappings. Toaddress these limitations, we introduce PARIC, a probabilistic framework forguiding visual attention via language specifications. Our approach enablespre-trained vision-language models to generate probabilistic referenceattention maps, which align textual and visual modalities more effectivelywhile incorporating uncertainty estimates, as compared to their deterministiccounterparts. Experiments on benchmark test problems demonstrate that PARICenhances prediction accuracy, mitigates bias, ensures consistent predictions,and improves robustness across various datasets.