Abstract
Diffusion-based human animation aims to animate a human character based on asource human image as well as driving signals such as a sequence of poses.Leveraging the generative capacity of diffusion model, existing approaches areable to generate high-fidelity poses, but struggle with significant viewpointchanges, especially in zoom-in/zoom-out scenarios where camera-characterdistance varies. This limits the applications such as cinematic shot type planor camera control. We propose a pose-correlated reference selection diffusionnetwork, supporting substantial viewpoint variations in human animation. Ourkey idea is to enable the network to utilize multiple reference images asinput, since significant viewpoint changes often lead to missing appearancedetails on the human body. To eliminate the computational cost, we firstintroduce a novel pose correlation module to compute similarities betweennon-aligned target and source poses, and then propose an adaptive referenceselection strategy, utilizing the attention map to identify key regions foranimation generation. To train our model, we curated a large dataset frompublic TED talks featuring varied shots of the same character, helping themodel learn synthesis for different perspectives. Our experimental results showthat with the same number of reference images, our model performs favorablycompared to the current SOTA methods under large viewpoint change. We furthershow that the adaptive reference selection is able to choose the most relevantreference regions to generate humans under free viewpoints.