Dual Prototype Evolving for Test-Time Generalization of Vision-Language Models

Abstract

Test-time adaptation, which enables models to generalize to diverse data withunlabeled test samples, holds significant value in real-world scenarios.Recently, researchers have applied this setting to advanced pre-trainedvision-language models (VLMs), developing approaches such as test-time prompttuning to further extend their practical applicability. However, these methodstypically focus solely on adapting VLMs from a single modality and fail toaccumulate task-specific knowledge as more samples are processed. To addressthis, we introduce Dual Prototype Evolving (DPE), a novel test-time adaptationapproach for VLMs that effectively accumulates task-specific knowledge frommulti-modalities. Specifically, we create and evolve two sets ofprototypes--textual and visual--to progressively capture more accuratemulti-modal representations for target classes during test time. Moreover, topromote consistent multi-modal representations, we introduce and optimizelearnable residuals for each test sample to align the prototypes from bothmodalities. Extensive experimental results on 15 benchmark datasets demonstratethat our proposed DPE consistently outperforms previous state-of-the-artmethods while also exhibiting competitive computational efficiency. Code isavailable at https://github.com/zhangce01/DPE-CLIP.

Quick Read (beta)

loading the full paper ...