Abstract
Learning generalizable robot manipulation policies, especially for complexmulti-fingered humanoids, remains a significant challenge. Existing approachesprimarily rely on extensive data collection and imitation learning, which areexpensive, labor-intensive, and difficult to scale. Sim-to-real reinforcementlearning (RL) offers a promising alternative, but has mostly succeeded insimpler state-based or single-hand setups. How to effectively extend this tovision-based, contact-rich bimanual manipulation tasks remains an openquestion. In this paper, we introduce a practical sim-to-real RL recipe thattrains a humanoid robot to perform three challenging dexterous manipulationtasks: grasp-and-reach, box lift and bimanual handover. Our method features anautomated real-to-sim tuning module, a generalized reward formulation based oncontact and object goals, a divide-and-conquer policy distillation framework,and a hybrid object representation strategy with modality-specificaugmentation. We demonstrate high success rates on unseen objects and robust,adaptive policy behaviors -- highlighting that vision-based dexterousmanipulation via sim-to-real RL is not only viable, but also scalable andbroadly applicable to real-world humanoid manipulation tasks.