HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models

Abstract

Large-scale but noisy image-text pair data have paved the way for the successof Contrastive Language-Image Pretraining (CLIP). As the foundation visionencoder, CLIP in turn serves as the cornerstone for most large vision-languagemodels (LVLMs). This interdependence naturally raises an interesting question:Can we reciprocally leverage LVLMs to enhance the quality of image-text pairdata, thereby opening the possibility of a self-reinforcing cycle forcontinuous improvement? In this work, we take a significant step toward thisvision by introducing an LVLM-driven data refinement pipeline. Our frameworkleverages LVLMs to process images and their raw alt-text, generating fourcomplementary textual formulas: long positive descriptions, long negativedescriptions, short positive tags, and short negative tags. Applying thispipeline to the curated DFN-Large dataset yields VLM-150M, a refined datasetenriched with multi-grained annotations. Based on this dataset, we furtherpropose a training paradigm that extends conventional contrastive learning byincorporating negative descriptions and short tags as additional supervisedsignals. The resulting model, namely HQ-CLIP, demonstrates remarkableimprovements across diverse benchmarks. Within a comparable training datascale, our approach achieves state-of-the-art performance in zero-shotclassification, cross-modal retrieval, and fine-grained visual understandingtasks. In retrieval benchmarks, HQ-CLIP even surpasses standard CLIP modelstrained on the DFN-2B dataset, which contains 10$\times$ more training datathan ours. All code, data, and models are available athttps://zxwei.site/hqclip.

Quick Read (beta)

loading the full paper ...