Language Grounding with 3D Objects

Abstract

Seemingly simple natural language requests to a robot are generallyunderspecified, for example "Can you bring me the wireless mouse?" When viewingmice on the shelf, the number of buttons or presence of a wire may not bevisible from certain angles or positions. Flat images of candidate mice may notprovide the discriminative information needed for "wireless". The world, andobjects in it, are not flat images but complex 3D shapes. If a human requestsan object based on any of its basic properties, such as color, shape, ortexture, robots should perform the necessary exploration to accomplish thetask. In particular, while substantial effort and progress has been made onunderstanding explicitly visual attributes like color and category,comparatively little progress has been made on understanding language aboutshapes and contours. In this work, we introduce a novel reasoning task thattargets both visual and non-visual language about 3D objects. Our newbenchmark, ShapeNet Annotated with Referring Expressions (SNARE), requires amodel to choose which of two objects is being referenced by a natural languagedescription. We introduce several CLIP-based models for distinguishing objectsand demonstrate that while recent advances in jointly modeling vision andlanguage are useful for robotic language understanding, it is still the casethat these models are weaker at understanding the 3D nature of objects --properties which play a key role in manipulation. In particular, we find thatadding view estimation to language grounding models improves accuracy on bothSNARE and when identifying objects referred to in language on a robot platform.

Quick Read (beta)

loading the full paper ...