Improving the Diffusability of Autoencoders

Abstract

Latent diffusion models have emerged as the leading approach for generatinghigh-quality images and videos, utilizing compressed latent representations toreduce the computational burden of the diffusion process. While recentadvancements have primarily focused on scaling diffusion backbones andimproving autoencoder reconstruction quality, the interaction between thesecomponents has received comparatively less attention. In this work, we performa spectral analysis of modern autoencoders and identify inordinatehigh-frequency components in their latent spaces, which are especiallypronounced in the autoencoders with a large bottleneck channel size. Wehypothesize that this high-frequency component interferes with thecoarse-to-fine nature of the diffusion synthesis process and hinders thegeneration quality. To mitigate the issue, we propose scale equivariance: asimple regularization strategy that aligns latent and RGB spaces acrossfrequencies by enforcing scale equivariance in the decoder. It requires minimalcode changes and only up to 20K autoencoder fine-tuning steps, yetsignificantly improves generation quality, reducing FID by 19% for imagegeneration on ImageNet-1K 256x256 and FVD by at least 44% for video generationon Kinetics-700 17x256x256.

Quick Read (beta)

loading the full paper ...