LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts

Abstract

Large-scale vision-language pre-trained (VLP) models (e.g., CLIP) arerenowned for their versatility, as they can be applied to diverse applicationsin a zero-shot setup. However, when these models are used in specific domains,their performance often falls short due to domain gaps or theunder-representation of these domains in the training data. While fine-tuningVLP models on custom datasets with human-annotated labels can address thisissue, annotating even a small-scale dataset (e.g., 100k samples) can be anexpensive endeavor, often requiring expert annotators if the task is complex.To address these challenges, we propose LatteCLIP, an unsupervised method forfine-tuning CLIP models on classification with known class names in customdomains, without relying on human annotations. Our method leverages LargeMultimodal Models (LMMs) to generate expressive textual descriptions for bothindividual images and groups of images. These provide additional contextualinformation to guide the fine-tuning process in the custom domains. SinceLMM-generated descriptions are prone to hallucination or missing details, weintroduce a novel strategy to distill only the useful information and stabilizethe training. Specifically, we learn rich per-class prototype representationsfrom noisy generated texts and dual pseudo-labels. Our experiments on 10domain-specific datasets show that LatteCLIP outperforms pre-trained zero-shotmethods by an average improvement of +4.74 points in top-1 accuracy and otherstate-of-the-art unsupervised methods by +3.45 points.

Quick Read (beta)

loading the full paper ...