Vision and Language Integration for Domain Generalization

Abstract

Domain generalization aims at training on source domains to uncover adomain-invariant feature space, allowing the model to perform robustgeneralization ability on unknown target domains. However, due to domain gaps,it is hard to find reliable common image feature space, and the reason for thatis the lack of suitable basic units for images. Different from image in visionspace, language has comprehensive expression elements that can effectivelyconvey semantics. Inspired by the semantic completeness of language andintuitiveness of image, we propose VLCA, which combine language space andvision space, and connect the multiple image domains by using semantic space asthe bridge domain. Specifically, in language space, by taking advantage of thecompleteness of language basic units, we tend to capture the semanticrepresentation of the relations between categories through word vectordistance. Then, in vision space, by taking advantage of the intuitiveness ofimage features, the common pattern of sample features with the same class isexplored through low-rank approximation. In the end, the languagerepresentation is aligned with the vision representation through the multimodalspace of text and image. Experiments demonstrate the effectiveness of theproposed method.

Quick Read (beta)

loading the full paper ...