UniCLIP: Unified Framework for Contrastive Language-Image Pre-training

Abstract

Pre-training vision-language models with contrastive objectives has shownpromising results that are both scalable to large uncurated datasets andtransferable to many downstream applications. Some following works havetargeted to improve data efficiency by adding self-supervision terms, butinter-domain (image-text) contrastive loss and intra-domain (image-image)contrastive loss are defined on individual spaces in those works, so manyfeasible combinations of supervision are overlooked. To overcome this issue, wepropose UniCLIP, a Unified framework for Contrastive Language-ImagePre-training. UniCLIP integrates the contrastive loss of both inter-domainpairs and intra-domain pairs into a single universal space. The discrepanciesthat occur when integrating contrastive loss between different domains areresolved by the three key components of UniCLIP: (1) augmentation-aware featureembedding, (2) MP-NCE loss, and (3) domain dependent similarity measure.UniCLIP outperforms previous vision-language pre-training methods on varioussingle- and multi-modality downstream tasks. In our experiments, we show thateach component that comprises UniCLIP contributes well to the finalperformance.

Quick Read (beta)

loading the full paper ...