Casper: Inferring Diverse Intents for Assistive Teleoperation with Vision Language Models

Abstract

Assistive teleoperation, where control is shared between a human and a robot,enables efficient and intuitive human-robot collaboration in diverse andunstructured environments. A central challenge in real-world assistiveteleoperation is for the robot to infer a wide range of human intentions fromuser control inputs and to assist users with correct actions. Existing methodsare either confined to simple, predefined scenarios or restricted totask-specific data distributions at training, limiting their support forreal-world assistance. We introduce Casper, an assistive teleoperation systemthat leverages commonsense knowledge embedded in pre-trained visual languagemodels (VLMs) for real-time intent inference and flexible skill execution.Casper incorporates an open-world perception module for a generalizedunderstanding of novel objects and scenes, a VLM-powered intent inferencemechanism that leverages commonsense reasoning to interpret snippets ofteleoperated user input, and a skill library that expands the scope of priorassistive teleoperation systems to support diverse, long-horizon mobilemanipulation tasks. Extensive empirical evaluation, including human studies andsystem ablations, demonstrates that Casper improves task performance, reduceshuman cognitive load, and achieves higher user satisfaction than directteleoperation and assistive teleoperation baselines.

Quick Read (beta)

loading the full paper ...