Grounding Language Attributes to Objects using Bayesian Eigenobjects

Abstract

We develop a system to disambiguate object instances within the same classbased on simple physical descriptions. The system takes as input a naturallanguage phrase and a depth image containing a segmented object and predictshow similar the observed object is to the object described by the phrase. Oursystem is designed to learn from only a small amount of human-labeled languagedata and generalize to viewpoints not represented in the language-annotateddepth image training set. By decoupling 3D shape representation from languagerepresentation, this method is able to ground language to novel objects using asmall amount of language-annotated depth-data and a larger corpus of unlabeled3D object meshes, even when these objects are partially observed from unusualviewpoints. Our system is able to disambiguate between novel objects, observedvia depth images, based on natural language descriptions. Our method alsoenables view-point transfer; trained on human-annotated data on a small set ofdepth images captured from frontal viewpoints, our system successfullypredicted object attributes from rear views despite having no such depth imagesin its training set. Finally, we demonstrate our approach on a Baxter robot,enabling it to pick specific objects based on human-provided natural languagedescriptions.

Quick Read (beta)

loading the full paper ...