Abstract
We present StyleFusion, a new mapping architecture for StyleGAN, which takesas input a number of latent codes and fuses them into a single style code.Inserting the resulting style code into a pre-trained StyleGAN generatorresults in a single harmonized image in which each semantic region iscontrolled by one of the input latent codes. Effectively, StyleFusion yields adisentangled representation of the image, providing fine-grained control overeach region of the generated image. Moreover, to help facilitate global controlover the generated image, a special input latent code is incorporated into thefused representation. StyleFusion operates in a hierarchical manner, where eachlevel is tasked with learning to disentangle a pair of image regions (e.g., thecar body and wheels). The resulting learned disentanglement allows one tomodify both local, fine-grained semantics (e.g., facial features) as well asmore global features (e.g., pose and background), providing improvedflexibility in the synthesis process. As a natural extension, StyleFusionenables one to perform semantically-aware cross-image mixing of regions thatare not necessarily aligned. Finally, we demonstrate how StyleFusion can bepaired with existing editing techniques to more faithfully constrain the editto the user's region of interest.