Abstract
Open-vocabulary generalization requires robotic systems to perform tasksinvolving complex and diverse environments and task goals. While the recentadvances in vision language models (VLMs) present unprecedented opportunitiesto solve unseen problems, how to utilize their emergent capabilities to controlrobots in the physical world remains an open question. In this paper, wepresent MOKA (Marking Open-vocabulary Keypoint Affordances), an approach thatemploys VLMs to solve robotic manipulation tasks specified by free-formlanguage descriptions. At the heart of our approach is a compact point-basedrepresentation of affordance and motion that bridges the VLM's predictions onRGB images and the robot's motions in the physical world. By prompting a VLMpre-trained on Internet-scale data, our approach predicts the affordances andgenerates the corresponding motions by leveraging the concept understanding andcommonsense knowledge from broad sources. To scaffold the VLM's reasoning inzero-shot, we propose a visual prompting technique that annotates marks on theimages, converting the prediction of keypoints and waypoints into a series ofvisual question answering problems that are feasible for the VLM to solve.Using the robot experiences collected in this way, we further investigate waysto bootstrap the performance through in-context learning and policydistillation. We evaluate and analyze MOKA's performance on a variety ofmanipulation tasks specified by free-form language descriptions, such as tooluse, deformable body manipulation, and object rearrangement.