Abstract
The scale and diversity of demonstration data required for imitation learningis a significant challenge. We present EgoMimic, a full-stack framework whichscales manipulation via human embodiment data, specifically egocentric humanvideos paired with 3D hand tracking. EgoMimic achieves this through: (1) asystem to capture human embodiment data using the ergonomic Project Ariaglasses, (2) a low-cost bimanual manipulator that minimizes the kinematic gapto human data, (3) cross-domain data alignment techniques, and (4) an imitationlearning architecture that co-trains on human and robot data. Compared to priorworks that only extract high-level intent from human videos, our approachtreats human and robot data equally as embodied demonstration data and learns aunified policy from both data sources. EgoMimic achieves significantimprovement on a diverse set of long-horizon, single-arm and bimanualmanipulation tasks over state-of-the-art imitation learning methods and enablesgeneralization to entirely new scenes. Finally, we show a favorable scalingtrend for EgoMimic, where adding 1 hour of additional hand data issignificantly more valuable than 1 hour of additional robot data. Videos andadditional information can be found at https://egomimic.github.io/