Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models

Abstract

This paper addresses the challenge of high-fidelity view synthesis of humanswith sparse-view videos as input. Previous methods solve the issue ofinsufficient observation by leveraging 4D diffusion models to generate videosat novel viewpoints. However, the generated videos from these models often lackspatio-temporal consistency, thus degrading view synthesis quality. In thispaper, we propose a novel sliding iterative denoising process to enhance thespatio-temporal consistency of the 4D diffusion model. Specifically, we definea latent grid in which each latent encodes the image, camera pose, and humanpose for a certain viewpoint and timestamp, then alternately denoising thelatent grid along spatial and temporal dimensions with a sliding window, andfinally decode the videos at target viewpoints from the corresponding denoisedlatents. Through the iterative sliding, information flows sufficiently acrossthe latent grid, allowing the diffusion model to obtain a large receptive fieldand thus enhance the 4D consistency of the output, while making the GPU memoryconsumption affordable. The experiments on the DNA-Rendering and ActorsHQdatasets demonstrate that our method is able to synthesize high-quality andconsistent novel-view videos and significantly outperforms the existingapproaches. See our project page for interactive demos and video results:https://diffuman4d.github.io/ .

Quick Read (beta)

loading the full paper ...