Fine-grained Visual-textual Representation Learning

Abstract

Fine-grained visual categorization is to recognize hundreds of subcategoriesbelonging to the same basic-level category, which is a highly challenging taskdue to the quite subtle and local visual distinctions among similarsubcategories. Most existing methods generally learn part detectors to discoverdiscriminative regions for better categorization performance. However, not allparts are beneficial and indispensable for visual categorization, and thesetting of part detector number heavily relies on prior knowledge as well asexperimental validation. As is known to all, when we describe the object of animage via textual descriptions, we mainly focus on the pivotal characteristics,and rarely pay attention to common characteristics as well as the backgroundareas. This is an involuntary transfer from human visual attention to textualattention, which leads to the fact that textual attention tells us how many andwhich parts are discriminative and significant to categorization. So textualattention could help us to discover visual attention in image. Inspired bythis, we propose a fine-grained visual-textual representation learning (VTRL)approach, and its main contributions are: (1) Fine-grained visual-textualpattern mining devotes to discovering discriminative visual-textual pairwiseinformation for boosting categorization performance through jointly modelingvision and text with generative adversarial networks (GANs), whichautomatically and adaptively discovers discriminative parts. (2) Visual-textualrepresentation learning jointly combines visual and textual information, whichpreserves the intra-modality and inter-modality information to generatecomplementary fine-grained representation, as well as further improvescategorization performance.

Quick Read (beta)

loading the full paper ...