Learning 6-DoF Object Poses to Grasp Category-level Objects by Language Instructions

Abstract

This paper studies the task of any objects grasping from the known categoriesby free-form language instructions. This task demands the technique in computervision, natural language processing, and robotics. We bring these disciplinestogether on this open challenge, which is essential to human-robot interaction.Critically, the key challenge lies in inferring the category of objects fromlinguistic instructions and accurately estimating the 6-DoF information ofunseen objects from the known classes. In contrast, previous works focus oninferring the pose of object candidates at the instance level. Thissignificantly limits its applications in real-world scenarios.In this paper, wepropose a language-guided 6-DoF category-level object localization model toachieve robotic grasping by comprehending human intention. To this end, wepropose a novel two-stage method. Particularly, the first stage grounds thetarget in the RGB image through language description of names, attributes, andspatial relations of objects. The second stage extracts and segments pointclouds from the cropped depth image and estimates the full 6-DoF object pose atcategory-level. Under such a manner, our approach can locate the specificobject by following human instructions, and estimate the full 6-DoF pose of acategory-known but unseen instance which is not utilized for training themodel. Extensive experimental results show that our method is competitive withthe state-of-the-art language-conditioned grasp method. Importantly, we deployour approach on a physical robot to validate the usability of our framework inreal-world applications. Please refer to the supplementary for the demo videosof our robot experiments.

Quick Read (beta)

loading the full paper ...