Abstract
In typical multimodal contrastive learning, such as CLIP, encoders produceone point in the latent representation space for each input. However, one-pointrepresentation has difficulty in capturing the relationship and the similaritystructure of a huge amount of instances in the real world. For richer classesof the similarity, we propose the use of weighted point sets, namely, sets ofpairs of weight and vector, as representations of instances. In this work, wetheoretically show the benefit of our proposed method through a newunderstanding of the contrastive loss of CLIP, which we call symmetric InfoNCE.We clarify that the optimal similarity that minimizes symmetric InfoNCE is thepointwise mutual information, and show an upper bound of excess risk ondownstream classification tasks of representations that achieve the optimalsimilarity. In addition, we show that our proposed similarity based on weightedpoint sets consistently achieves the optimal similarity. To verify theeffectiveness of our proposed method, we demonstrate pretraining of text-imagerepresentation models and classification tasks on common benchmarks.