Abstract
Annotating camera poses on dynamic Internet videos at scale is critical foradvancing fields like realistic video generation and simulation. However,collecting such a dataset is difficult, as most Internet videos are unsuitablefor pose estimation. Furthermore, annotating dynamic Internet videos presentsignificant challenges even for state-of-theart methods. In this paper, weintroduce DynPose-100K, a large-scale dataset of dynamic Internet videosannotated with camera poses. Our collection pipeline addresses filtering usinga carefully combined set of task-specific and generalist models. For poseestimation, we combine the latest techniques of point tracking, dynamicmasking, and structure-from-motion to achieve improvements over thestate-of-the-art approaches. Our analysis and experiments demonstrate thatDynPose-100K is both large-scale and diverse across several key attributes,opening up avenues for advancements in various downstream applications.