Abstract
Contrastive learning allows us to flexibly define powerful losses bycontrasting positive pairs from sets of negative samples. Recently, theprinciple has also been used to learn cross-modal embeddings for video andtext, yet without exploiting its full potential. In particular, previous lossesdo not take the intra-modality similarities into account, which leads toinefficient embeddings, as the same content is mapped to multiple points in theembedding space. With CrossCLR, we present a contrastive loss that fixes thisissue. Moreover, we define sets of highly related samples in terms of theirinput embeddings and exclude them from the negative samples to avoid issueswith false negatives. We show that these principles consistently improve thequality of the learned embeddings. The joint embeddings learned with CrossCLRextend the state of the art in video-text retrieval on Youcook2 and LSMDCdatasets and in video captioning on Youcook2 dataset by a large margin. We alsodemonstrate the generality of the concept by learning improved joint embeddingsfor other pairs of modalities.