Abstract
Cross-modal retrieval between visual data and natural language descriptionremains a long-standing challenge in multimedia. While recent image-textretrieval methods offer great promise by learning deep representations alignedacross modalities, most of these methods are plagued by the issue of trainingwith small-scale datasets covering a limited number of images with ground-truthsentences. Moreover, it is extremely expensive to create a larger dataset byannotating millions of images with sentences and may lead to a biased model.Inspired by the recent success of webly supervised learning in deep neuralnetworks, we capitalize on readily-available web images with noisy annotationsto learn robust image-text joint representation. Specifically, our main idea isto leverage web images and corresponding tags, along with fully annotateddatasets, in training for learning the visual-semantic joint embedding. Wepropose a two-stage approach for the task that can augment a typical supervisedpair-wise ranking loss based formulation with weakly-annotated web images tolearn a more robust visual-semantic embedding. Experiments on two standardbenchmark datasets demonstrate that our method achieves a significantperformance gain in image-text retrieval compared to state-of-the-artapproaches.