SAGE: Bridging Semantic and Actionable Parts for GEneralizable Articulated-Object Manipulation under Language Instructions

Abstract

Generalizable manipulation of articulated objects remains a challengingproblem in many real-world scenarios, given the diverse object structures,functionalities, and goals. In these tasks, both semantic interpretations andphysical plausibilities are crucial for a policy to succeed. To address thisproblem, we propose SAGE, a novel framework that bridges the understanding ofsemantic and actionable parts of articulated objects to achieve generalizablemanipulation under language instructions. Given a manipulation goal specifiedby natural language, an instruction interpreter with Large Language Models(LLMs) first translates them into programmatic actions on the object's semanticparts. This process also involves a scene context parser for understanding thevisual inputs, which is designed to generate scene descriptions with both richinformation and accurate interaction-related facts by joining the forces ofgeneralist Visual-Language Models (VLMs) and domain-specialist part perceptionmodels. To further convert the action programs into executable policies, a partgrounding module then maps the object semantic parts suggested by theinstruction interpreter into so-called Generalizable Actionable Parts(GAParts). Finally, an interactive feedback module is incorporated to respondto failures, which greatly increases the robustness of the overall framework.Experiments both in simulation environments and on real robots show that ourframework can handle a large variety of articulated objects with diverselanguage-instructed goals. We also provide a new benchmark for language-guidedarticulated-object manipulation in realistic scenarios.

Quick Read (beta)

loading the full paper ...