Unified Loss of Pair Similarity Optimization for Vision-Language Retrieval

Abstract

There are two popular loss functions used for vision-language retrieval,i.e., triplet loss and contrastive learning loss, both of them essentiallyminimize the difference between the similarities of negative pairs and positivepairs. More specifically, Triplet loss with Hard Negative mining (Triplet-HN),which is widely used in existing retrieval models to improve the discriminativeability, is easy to fall into local minima in training. On the other hand,Vision-Language Contrastive learning loss (VLC), which is widely used in thevision-language pre-training, has been shown to achieve significant performancegains on vision-language retrieval, but the performance of fine-tuning with VLCon small datasets is not satisfactory. This paper proposes a unified loss ofpair similarity optimization for vision-language retrieval, providing apowerful tool for understanding existing loss functions. Our unified lossincludes the hard sample mining strategy of VLC and introduces the margin usedby the triplet loss for better similarity separation. It is shown that bothTriplet-HN and VLC are special forms of our unified loss. Compared with theTriplet-HN, our unified loss has a fast convergence speed. Compared with theVLC, our unified loss is more discriminative and can provide bettergeneralization in downstream fine-tuning tasks. Experiments on image-text andvideo-text retrieval benchmarks show that our unified loss can significantlyimprove the performance of the state-of-the-art retrieval models.

Quick Read (beta)

loading the full paper ...