Depth Any Video with Scalable Synthetic Data

Abstract

Video depth estimation has long been hindered by the scarcity of consistentand scalable ground truth data, leading to inconsistent and unreliable results.In this paper, we introduce Depth Any Video, a model that tackles the challengethrough two key innovations. First, we develop a scalable synthetic datapipeline, capturing real-time video depth data from diverse syntheticenvironments, yielding 40,000 video clips of 5-second duration, each withprecise depth annotations. Second, we leverage the powerful priors ofgenerative video diffusion models to handle real-world videos effectively,integrating advanced techniques such as rotary position encoding and flowmatching to further enhance flexibility and efficiency. Unlike previous models,which are limited to fixed-length video sequences, our approach introduces anovel mixed-duration training strategy that handles videos of varying lengthsand performs robustly across different frame rates-even on single frames. Atinference, we propose a depth interpolation method that enables our model toinfer high-resolution video depth across sequences of up to 150 frames. Ourmodel outperforms all previous generative depth models in terms of spatialaccuracy and temporal consistency.

Quick Read (beta)

loading the full paper ...