Abstract
We present Ego-Exo4D, a diverse, large-scale multimodal multiview videodataset and benchmark challenge. Ego-Exo4D centers aroundsimultaneously-captured egocentric and exocentric video of skilled humanactivities (e.g., sports, music, dance, bike repair). 740 participants from 13cities worldwide performed these activities in 123 different natural scenecontexts, yielding long-form captures from 1 to 42 minutes each and 1,286 hoursof video combined. The multimodal nature of the dataset is unprecedented: thevideo is accompanied by multichannel audio, eye gaze, 3D point clouds, cameraposes, IMU, and multiple paired language descriptions -- including a novel"expert commentary" done by coaches and teachers and tailored to theskilled-activity domain. To push the frontier of first-person videounderstanding of skilled human activity, we also present a suite of benchmarktasks and their annotations, including fine-grained activity understanding,proficiency estimation, cross-view translation, and 3D hand/body pose. Allresources are open sourced to fuel new research in the community. Project page:http://ego-exo4d-data.org/