BiFold: Bimanual Cloth Folding with Language Guidance

Abstract

Cloth folding is a complex task due to the inevitable self-occlusions ofclothes, their complicated dynamics, and the disparate materials, geometries,and textures that garments can have. In this work, we learn folding actionsconditioned on text commands. Translating high-level, abstract instructionsinto precise robotic actions requires sophisticated language understanding andmanipulation capabilities. To do that, we leverage a pre-trainedvision-language model and repurpose it to predict manipulation actions. Ourmodel, BiFold, can take context into account and achieves state-of-the-artperformance on an existing language-conditioned folding benchmark. To addressthe lack of annotated bimanual folding data, we introduce a novel dataset withautomatically parsed actions and language-aligned instructions, enabling betterlearning of text-conditioned manipulation. BiFold attains the best performanceon our dataset and demonstrates strong generalization to new instructions,garments, and environments.

Quick Read (beta)

loading the full paper ...