Tex4D: Zero-shot 4D Scene Texturing with Video Diffusion Models

Abstract

3D meshes are widely used in computer vision and graphics for theirefficiency in animation and minimal memory use, playing a crucial role inmovies, games, AR, and VR. However, creating temporally consistent andrealistic textures for mesh sequences remains labor-intensive for professionalartists. On the other hand, while video diffusion models excel at text-drivenvideo generation, they often lack 3D geometry awareness and struggle withachieving multi-view consistent texturing for 3D meshes. In this work, wepresent Tex4D, a zero-shot approach that integrates inherent 3D geometryknowledge from mesh sequences with the expressiveness of video diffusion modelsto produce multi-view and temporally consistent 4D textures. Given anuntextured mesh sequence and a text prompt as inputs, our method enhancesmulti-view consistency by synchronizing the diffusion process across differentviews through latent aggregation in the UV space. To ensure temporalconsistency, we leverage prior knowledge from a conditional video generationmodel for texture synthesis. However, straightforwardly combining the videodiffusion model and the UV texture aggregation leads to blurry results. Weanalyze the underlying causes and propose a simple yet effective modificationto the DDIM sampling process to address this issue. Additionally, we introducea reference latent texture to strengthen the correlation between frames duringthe denoising process. To the best of our knowledge, Tex4D is the first methodspecifically designed for 4D scene texturing. Extensive experiments demonstrateits superiority in producing multi-view and multi-frame consistent videos basedon untextured mesh sequences.

Quick Read (beta)

loading the full paper ...