Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance

Abstract

6-DoF grasp detection has been a fundamental and challenging problem inrobotic vision. While previous works have focused on ensuring grasp stability,they often do not consider human intention conveyed through natural language,hindering effective collaboration between robots and users in complex 3Denvironments. In this paper, we present a new approach for language-driven6-DoF grasp detection in cluttered point clouds. We first introduceGrasp-Anything-6D, a large-scale dataset for the language-driven 6-DoF graspdetection task with 1M point cloud scenes and more than 200Mlanguage-associated 3D grasp poses. We further introduce a novel diffusionmodel that incorporates a new negative prompt guidance learning strategy. Theproposed negative prompt strategy directs the detection process toward thedesired object while steering away from unwanted ones given the language input.Our method enables an end-to-end framework where humans can command the robotto grasp desired objects in a cluttered scene using natural language. Intensiveexperimental results show the effectiveness of our method in both benchmarkingexperiments and real-world scenarios, surpassing other baselines. In addition,we demonstrate the practicality of our approach in real-world roboticapplications. Our project is available athttps://airvlab.github.io/grasp-anything.

Quick Read (beta)

loading the full paper ...