Abstract
Imitation learning for robotic manipulation faces a fundamental challenge:the scarcity of large-scale, high-quality robot demonstration data. Recentrobotic foundation models often pre-train on cross-embodiment robot datasets toincrease data scale, while they face significant limitations as the diversemorphologies and action spaces across different robot embodiments make unifiedtraining challenging. In this paper, we present H-RDT (Human to RoboticsDiffusion Transformer), a novel approach that leverages human manipulation datato enhance robot manipulation capabilities. Our key insight is that large-scaleegocentric human manipulation videos with paired 3D hand pose annotationsprovide rich behavioral priors that capture natural manipulation strategies andcan benefit robotic policy learning. We introduce a two-stage trainingparadigm: (1) pre-training on large-scale egocentric human manipulation data,and (2) cross-embodiment fine-tuning on robot-specific data with modular actionencoders and decoders. Built on a diffusion transformer architecture with 2Bparameters, H-RDT uses flow matching to model complex action distributions.Extensive evaluations encompassing both simulation and real-world experiments,single-task and multitask scenarios, as well as few-shot learning androbustness assessments, demonstrate that H-RDT outperforms training fromscratch and existing state-of-the-art methods, including Pi0 and RDT, achievingsignificant improvements of 13.9% and 40.5% over training from scratch insimulation and real-world experiments, respectively. The results validate ourcore hypothesis that human manipulation data can serve as a powerful foundationfor learning bimanual robotic manipulation policies.