Robot Object Retrieval with Contextual Natural Language Queries

Abstract

Natural language object retrieval is a highly useful yet challenging task forrobots in human-centric environments. Previous work has primarily focused oncommands specifying the desired object's type such as "scissors" and/or visualattributes such as "red," thus limiting the robot to only known object classes.We develop a model to retrieve objects based on descriptions of their usage.The model takes in a language command containing a verb, for example "Hand mesomething to cut," and RGB images of candidate objects and selects the objectthat best satisfies the task specified by the verb. Our model directly predictsan object's appearance from the object's use specified by a verb phrase. We donot need to explicitly specify an object's class label. Our approach allows usto predict high level concepts like an object's utility based on the languagequery. Based on contextual information present in the language commands, ourmodel can generalize to unseen object classes and unknown nouns in thecommands. Our model correctly selects objects out of sets of five candidates tofulfill natural language commands, and achieves an average accuracy of 62.3% ona held-out test set of unseen ImageNet object classes and 53.0% on unseenobject classes and unknown nouns. Our model also achieves an average accuracyof 54.7% on unseen YCB object classes, which have a different imagedistribution from ImageNet objects. We demonstrate our model on a KUKA LBR iiwarobot arm, enabling the robot to retrieve objects based on natural languagedescriptions of their usage. We also present a new dataset of 655 verb-objectpairs denoting object usage over 50 verbs and 216 object classes.

Quick Read (beta)

loading the full paper ...