Abstract
A versatile video depth estimation model should (1) be accurate andconsistent across frames, (2) produce high-resolution depth maps, and (3)support real-time streaming. We propose FlashDepth, a method that satisfies allthree requirements, performing depth estimation on a 2044x1148 streaming videoat 24 FPS. We show that, with careful modifications to pretrained single-imagedepth models, these capabilities are enabled with relatively little data andtraining. We evaluate our approach across multiple unseen datasets againststate-of-the-art depth models, and find that ours outperforms them in terms ofboundary sharpness and speed by a significant margin, while maintainingcompetitive accuracy. We hope our model will enable various applications thatrequire high-resolution depth, such as video editing, and onlinedecision-making, such as robotics.