Improving Deep Visual Representation for Person Re-identification by Global and Local Image-language Association

Abstract

Person re-identification is an important task that requires learningdiscriminative visual features for distinguishing different person identities.Diverse auxiliary information has been utilized to improve the visual featurelearning. In this paper, we propose to exploit natural language description asadditional training supervisions for effective visual features. Compared withother auxiliary information, language can describe a specific person from morecompact and semantic visual aspects, thus is complementary to the pixel-levelimage data. Our method not only learns better global visual feature with thesupervision of the overall description but also enforces semantic consistenciesbetween local visual and linguistic features, which is achieved by buildingglobal and local image-language associations. The global image-languageassociation is established according to the identity labels, while the localassociation is based upon the implicit correspondences between image regionsand noun phrases. Extensive experiments demonstrate the effectiveness ofemploying language as training supervisions with the two association schemes.Our method achieves state-of-the-art performance without utilizing anyauxiliary information during testing and shows better performance than otherjoint embedding methods for the image-language association.

Quick Read (beta)

loading the full paper ...