Abstract
General robotic grasping systems require accurate object affordanceperception in diverse open-world scenarios following human instructions.However, current studies suffer from the problem of lacking reasoning-basedlarge-scale affordance prediction data, leading to considerable concern aboutopen-world effectiveness. To address this limitation, we build a large-scalegrasping-oriented affordance segmentation benchmark with human-likeinstructions, named RAGNet. It contains 273k images, 180 categories, and 26kreasoning instructions. The images cover diverse embodied data domains, such aswild, robot, ego-centric, and even simulation data. They are carefullyannotated with an affordance map, while the difficulty of language instructionsis largely increased by removing their category name and only providingfunctional descriptions. Furthermore, we propose a comprehensiveaffordance-based grasping framework, named AffordanceNet, which consists of aVLM pre-trained on our massive affordance data and a grasping network thatconditions an affordance map to grasp the target. Extensive experiments onaffordance segmentation benchmarks and real-robot manipulation tasks show thatour model has a powerful open-world generalization ability. Our data and codeis available at https://github.com/wudongming97/AffordanceNet.