Abstract
Video depth estimation has long been hindered by the scarcity of consistentand scalable ground truth data, leading to inconsistent and unreliable results.In this paper, we introduce Depth Any Video, a model that tackles the challengethrough two key innovations. First, we develop a scalable synthetic datapipeline, capturing real-time video depth data from diverse syntheticenvironments, yielding 40,000 video clips of 5-second duration, each withprecise depth annotations. Second, we leverage the powerful priors ofgenerative video diffusion models to handle real-world videos effectively,integrating advanced techniques such as rotary position encoding and flowmatching to further enhance flexibility and efficiency. Unlike previous models,which are limited to fixed-length video sequences, our approach introduces anovel mixed-duration training strategy that handles videos of varying lengthsand performs robustly across different frame rates-even on single frames. Atinference, we propose a depth interpolation method that enables our model toinfer high-resolution video depth across sequences of up to 150 frames. Ourmodel outperforms all previous generative depth models in terms of spatialaccuracy and temporal consistency.