Abstract
Intelligent agents must autonomously interact with the environments toperform daily tasks based on human-level instructions. They need a foundationalunderstanding of the world to accurately interpret these instructions, alongwith precise low-level movement and interaction skills to execute the derivedactions. In this work, we propose the first complete system for synthesizingphysically plausible, long-horizon human-object interactions for objectmanipulation in contextual environments, driven by human-level instructions. Weleverage large language models (LLMs) to interpret the input instructions intodetailed execution plans. Unlike prior work, our system is capable ofgenerating detailed finger-object interactions, in seamless coordination withfull-body movements. We also train a policy to track generated motions inphysics simulation via reinforcement learning (RL) to ensure physicalplausibility of the motion. Our experiments demonstrate the effectiveness ofour system in synthesizing realistic interactions with diverse objects incomplex environments, highlighting its potential for real-world applications.