Finding beans in burgers: Deep semantic-visual embedding with localization

Abstract

Several works have proposed to learn a two-path neural network that mapsimages and texts, respectively, to a same shared Euclidean space where geometrycaptures useful semantic relationships. Such a multi-modal embedding can betrained and used for various tasks, notably image captioning. In the presentwork, we introduce a new architecture of this type, with a visual path thatleverages recent space-aware pooling mechanisms. Combined with a textual pathwhich is jointly trained from scratch, our semantic-visual embedding offers aversatile model. Once trained under the supervision of captioned images, ityields new state-of-the-art performance on cross-modal retrieval. It alsoallows the localization of new concepts from the embedding space into any inputimage, delivering state-of-the-art result on the visual grounding of phrases.

Quick Read (beta)

loading the full paper ...