Abstract
Real robot data collection for imitation learning has led to significantadvancements in robotic manipulation. However, the requirement for robothardware in the process fundamentally constrains the scale of the data. In thispaper, we explore training Vision-Language-Action (VLA) models using egocentrichuman videos. The benefit of using human videos is not only for their scale butmore importantly for the richness of scenes and tasks. With a VLA trained onhuman video that predicts human wrist and hand actions, we can perform InverseKinematics and retargeting to convert the human actions to robot actions. Wefine-tune the model using a few robot manipulation demonstrations to obtain therobot policy, namely EgoVLA. We propose a simulation benchmark called EgoHumanoid Manipulation Benchmark, where we design diverse bimanual manipulationtasks with demonstrations. We fine-tune and evaluate EgoVLA with Ego HumanoidManipulation Benchmark and show significant improvements over baselines andablate the importance of human data. Videos can be found on our website:https://rchalyang.github.io/EgoVLA