Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning

Abstract

In natural language processing, most models try to learn semanticrepresentations merely from texts. The learned representations encode thedistributional semantics but fail to connect to any knowledge about thephysical world. In contrast, humans learn language by grounding concepts inperception and action and the brain encodes grounded semantics for cognition.Inspired by this notion and recent work in vision-language learning, we designa two-stream model for grounding language learning in vision. The modelincludes a VGG-based visual stream and a Bert-based language stream. The twostreams merge into a joint representational space. Through cross-modalcontrastive learning, the model first learns to align visual and languagerepresentations with the MS COCO dataset. The model further learns to retrievevisual objects with language queries through a cross-modal attention module andto infer the visual relations between the retrieved objects through a bilinearoperator with the Visual Genome dataset. After training, the language stream ofthis model is a stand-alone language model capable of embedding concepts in avisually grounded semantic space. This semantic space manifests principaldimensions explainable with human intuition and neurobiological knowledge. Wordembeddings in this semantic space are predictive of human-defined norms ofsemantic features and are segregated into perceptually distinctive clusters.Furthermore, the visually grounded language model also enables compositionallanguage understanding based on visual knowledge and multimodal image searchwith queries based on images, texts, or their combinations.

Quick Read (beta)

loading the full paper ...