Abstract
We develop a system to disambiguate objects based on simple physicaldescriptions. The system takes as input a natural language phrase and a depthimage containing a segmented object and predicts how similar the observedobject is to the described object. Our system is designed to learn from only asmall amount of human-labeled language data and generalize to viewpoints notrepresented in the language-annotated depth-image training set. By decoupling3D shape representation from language representation, our method is able toground language to novel objects using a small amount of language-annotateddepth-data and a larger corpus of unlabeled 3D object meshes, even when theseobjects are partially observed from unusual viewpoints. Our system is able todisambiguate between novel objects, observed via depth-images, based on naturallanguage descriptions. Our method also enables view-point transfer; trained onhuman-annotated data on a small set of depth-images captured from frontalviewpoints, our system successfully predicted object attributes from rear viewsdespite having no such depth images in its training set. Finally, wedemonstrate our system on a Baxter robot, enabling it to pick specific objectsbased on human-provided natural language descriptions.