Dual-Path Convolutional Image-Text Embedding with Instance Loss

Abstract

Matching images and sentences demands a fine understanding of bothmodalities. In this paper, we propose a new system to discriminatively embedthe image and text to a shared visual-textual space. In this field, mostexisting works apply the ranking loss to pull the positive image / text pairsclose and push the negative pairs apart from each other. However, directlydeploying the ranking loss is hard for network learning, since it starts fromthe two heterogeneous features to build inter-modal relationship. To addressthis problem, we propose the instance loss which explicitly considers theintra-modal data distribution. It is based on an unsupervised assumption thateach image / text group can be viewed as a class. So the network can learn thefine granularity from every image/text group. The experiment shows that theinstance loss offers better weight initialization for the ranking loss, so thatmore discriminative embeddings can be learned. Besides, existing works usuallyapply the off-the-shelf features, i.e., word2vec and fixed visual feature. Soin a minor contribution, this paper constructs an end-to-end dual-pathconvolutional network to learn the image and text representations. End-to-endlearning allows the system to directly learn from the data and fully utilizethe supervision. On two generic retrieval datasets (Flickr30k and MSCOCO),experiments demonstrate that our method yields competitive accuracy compared tostate-of-the-art methods. Moreover, in language based person retrieval, weimprove the state of the art by a large margin. The code has been made publiclyavailable.

Quick Read (beta)

loading the full paper ...