Learning from Massive Human Videos for Universal Humanoid Pose Control

Abstract

Scalable learning of humanoid robots is crucial for their deployment inreal-world applications. While traditional approaches primarily rely onreinforcement learning or teleoperation to achieve whole-body control, they areoften limited by the diversity of simulated environments and the high costs ofdemonstration collection. In contrast, human videos are ubiquitous and presentan untapped source of semantic and motion information that could significantlyenhance the generalization capabilities of humanoid robots. This paperintroduces Humanoid-X, a large-scale dataset of over 20 million humanoid robotposes with corresponding text-based motion descriptions, designed to leveragethis abundant data. Humanoid-X is curated through a comprehensive pipeline:data mining from the Internet, video caption generation, motion retargeting ofhumans to humanoid robots, and policy learning for real-world deployment. WithHumanoid-X, we further train a large humanoid model, UH-1, which takes textinstructions as input and outputs corresponding actions to control a humanoidrobot. Extensive simulated and real-world experiments validate that ourscalable training approach leads to superior generalization in text-basedhumanoid control, marking a significant step toward adaptable, real-world-readyhumanoid robots.

Quick Read (beta)

loading the full paper ...