Abstract
We present ForceSight, a system for text-guided mobile manipulation thatpredicts visual-force goals using a deep neural network. Given a single RGBDimage combined with a text prompt, ForceSight determines a target end-effectorpose in the camera frame (kinematic goal) and the associated forces (forcegoal). Together, these two components form a visual-force goal. Prior work hasdemonstrated that deep models outputting human-interpretable kinematic goalscan enable dexterous manipulation by real robots. Forces are critical tomanipulation, yet have typically been relegated to lower-level execution inthese systems. When deployed on a mobile manipulator equipped with aneye-in-hand RGBD camera, ForceSight performed tasks such as precision grasps,drawer opening, and object handovers with an 81% success rate in unseenenvironments with object instances that differed significantly from thetraining data. In a separate experiment, relying exclusively on visual servoingand ignoring force goals dropped the success rate from 90% to 45%,demonstrating that force goals can significantly enhance performance. Theappendix, videos, code, and trained models are available athttps://force-sight.github.io/.