NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors

Abstract

Surface normal estimation serves as a cornerstone for a spectrum of computervision applications. While numerous efforts have been devoted to static imagescenarios, ensuring temporal coherence in video-based normal estimation remainsa formidable challenge. Instead of merely augmenting existing methods withtemporal components, we present NormalCrafter to leverage the inherent temporalpriors of video diffusion models. To secure high-fidelity normal estimationacross sequences, we propose Semantic Feature Regularization (SFR), whichaligns diffusion features with semantic cues, encouraging the model toconcentrate on the intrinsic semantics of the scene. Moreover, we introduce atwo-stage training protocol that leverages both latent and pixel space learningto preserve spatial accuracy while maintaining long temporal context. Extensiveevaluations demonstrate the efficacy of our method, showcasing a superiorperformance in generating temporally consistent normal sequences with intricatedetails from diverse videos.

Quick Read (beta)

loading the full paper ...