Abstract
Data-driven approaches struggle with precise manipulation; imitation learningrequires many hard-to-obtain demonstrations, while reinforcement learningyields brittle, non-generalizable policies. We introduce VisuoTactile Local(ViTaL) policy learning, a framework that solves fine-grained manipulationtasks by decomposing them into two phases: a reaching phase, where avision-language model (VLM) enables scene-level reasoning to localize theobject of interest, and a local interaction phase, where a reusable,scene-agnostic ViTaL policy performs contact-rich manipulation using egocentricvision and tactile sensing. This approach is motivated by the observation thatwhile scene context varies, the low-level interaction remains consistent acrosstask instances. By training local policies once in a canonical setting, theycan generalize via a localize-then-execute strategy. ViTaL achieves around 90%success on contact-rich tasks in unseen environments and is robust todistractors. ViTaL's effectiveness stems from three key insights: (1)foundation models for segmentation enable training robust visual encoders viabehavior cloning; (2) these encoders improve the generalizability of policieslearned using residual RL; and (3) tactile sensing significantly boostsperformance in contact-rich tasks. Ablation studies validate each of theseinsights, and we demonstrate that ViTaL integrates well with high-level VLMs,enabling robust, reusable low-level skills. Results and videos are available athttps://vitalprecise.github.io.