OTFusion: Bridging Vision-only and Vision-Language Models via Optimal Transport for Transductive Zero-Shot Learning

Abstract

Transductive zero-shot learning (ZSL) aims to classify unseen categories byleveraging both semantic class descriptions and the distribution of unlabeledtest data. While Vision-Language Models (VLMs) such as CLIP excel at aligningvisual inputs with textual semantics, they often rely too heavily onclass-level priors and fail to capture fine-grained visual cues. In contrast,Vision-only Foundation Models (VFMs) like DINOv2 provide rich perceptualfeatures but lack semantic alignment. To exploit the complementary strengths ofthese models, we propose OTFusion, a simple yet effective training-freeframework that bridges VLMs and VFMs via Optimal Transport. Specifically,OTFusion aims to learn a shared probabilistic representation that aligns visualand semantic information by minimizing the transport cost between theirrespective distributions. This unified distribution enables coherent classpredictions that are both semantically meaningful and visually grounded.Extensive experiments on 11 benchmark datasets demonstrate that OTFusionconsistently outperforms the original CLIP model, achieving an average accuracyimprovement of nearly $10\%$, all without any fine-tuning or additionalannotations. The code will be publicly released after the paper is accepted.

Quick Read (beta)

loading the full paper ...