Language-Mediated, Object-Centric Representation Learning

Abstract

We present Language-mediated, Object-centric Representation Learning (LORL),a paradigm for learning disentangled, object-centric scene representations fromvision and language. LORL builds upon recent advances in unsupervised objectdiscovery and segmentation, notably MONet and Slot Attention. While thesealgorithms learn an object-centric representation just by reconstructing theinput image, LORL enables them to further learn to associate the learnedrepresentations to concepts, i.e., words for object categories, properties, andspatial relationships, from language input. These object-centric conceptsderived from language facilitate the learning of object-centricrepresentations. LORL can be integrated with various unsupervised objectdiscovery algorithms that are language-agnostic. Experiments show that theintegration of LORL consistently improves the performance of unsupervisedobject discovery methods on two datasets via the help of language. We also showthat concepts learned by LORL, in conjunction with object discovery methods,aid downstream tasks such as referring expression comprehension.

Quick Read (beta)

loading the full paper ...