Language-Mediated, Object-Centric Representation Learning

Abstract

We present Language-mediated, Object-centric Representation Learning (LORL),a paradigm for learning disentangled, object-centric scene representations fromvision and language. LORL builds upon recent advances in unsupervised objectsegmentation, notably MONet and Slot Attention. While these algorithms learn anobject-centric representation just by reconstructing the input image, LORLenables them to further learn to associate the learned representations toconcepts, i.e., words for object categories, properties, and spatialrelationships, from language input. These object-centric concepts derived fromlanguage facilitate the learning of object-centric representations. LORL can beintegrated with various unsupervised segmentation algorithms that arelanguage-agnostic. Experiments show that the integration of LORL consistentlyimproves the object segmentation performance of MONet and Slot Attention on twodatasets via the help of language. We also show that concepts learned by LORL,in conjunction with segmentation algorithms such as MONet, aid downstream taskssuch as referring expression comprehension.

Quick Read (beta)

loading the full paper ...