Embodied Language Grounding with Implicit 3D Visual Feature Representations

Abstract

Consider the utterance "the tomato is to the left of the pot." Humans cananswer numerous questions about the situation described, as well as reasonthrough counterfactuals and alternatives, such as, "is the pot larger than thetomato ?", "can we move to a viewpoint from which the tomato is completelyhidden behind the pot ?", "can we have an object that is both to the left ofthe tomato and to the right of the pot ?", "would the tomato fit inside the pot?", and so on. Such reasoning capability remains elusive from currentcomputational models of language understanding. To link language processingwith spatial reasoning, we propose associating natural language utterances to amental workspace of their meaning, encoded as 3-dimensional visual featurerepresentations of the world scenes they describe. We learn such 3-dimensionalvisual representations---we call them visual imaginations--- by predictingimages a mobile agent sees while moving around in the 3D world. The input imagestreams the agent collects are unprojected into egomotion-stable 3D scenefeature maps of the scene, and projected from novel viewpoints to match theobserved RGB image views in an end-to-end differentiable manner. We then trainmodular neural models to generate such 3D feature representations givenlanguage utterances, to localize the objects an utterance mentions in the 3Dfeature representation inferred from an image, and to predict the desired 3Dobject locations given a manipulation instruction. We empirically show theproposed models outperform by a large margin existing 2D models in spatialreasoning, referential object detection and instruction following, andgeneralize better across camera viewpoints and object arrangements.

Quick Read (beta)

loading the full paper ...