Abstract
Video try-on is a challenging task and has not been well tackled in previousworks. The main obstacle lies in preserving the details of the clothing andmodeling the coherent motions simultaneously. Faced with those difficulties, weaddress video try-on by proposing a diffusion-based framework named "TunnelTry-on." The core idea is excavating a "focus tunnel" in the input video thatgives close-up shots around the clothing regions. We zoom in on the region inthe tunnel to better preserve the fine details of the clothing. To generatecoherent motions, we first leverage the Kalman filter to construct smooth cropsin the focus tunnel and inject the position embedding of the tunnel intoattention layers to improve the continuity of the generated videos. Inaddition, we develop an environment encoder to extract the context informationoutside the tunnels as supplementary cues. Equipped with these techniques,Tunnel Try-on keeps the fine details of the clothing and synthesizes stable andsmooth videos. Demonstrating significant advancements, Tunnel Try-on could beregarded as the first attempt toward the commercial-level application ofvirtual try-on in videos.