Abstract
Deep generative models have shown impressive results in generating realisticimages of faces. GANs managed to generate high-quality, high-fidelity imageswhen conditioned on semantic masks, but they still lack the ability todiversify their output. Diffusion models partially solve this problem and areable to generate diverse samples given the same condition. In this paper, wepropose a multi-conditioning approach for diffusion models via cross-attentionexploiting both attributes and semantic masks to generate high-quality andcontrollable face images. We also studied the impact of applyingperceptual-focused loss weighting into the latent space instead of the pixelspace. Our method extends the previous approaches by introducing conditioningon more than one set of features, guaranteeing a more fine-grained control overthe generated face images. We evaluate our approach on the CelebA-HQ dataset,and we show that it can generate realistic and diverse samples while allowingfor fine-grained control over multiple attributes and semantic regions.Additionally, we perform an ablation study to evaluate the impact of differentconditioning strategies on the quality and diversity of the generated images.