Abstract
We present a method for semantically transferring the visual appearance ofone natural image to another. Specifically, our goal is to generate an image inwhich objects in a source structure image are "painted" with the visualappearance of their semantically related objects in a target appearance image.Our method works by training a generator given only a singlestructure/appearance image pair as input. To integrate semantic informationinto our framework - a pivotal component in tackling this task - our key ideais to leverage a pre-trained and fixed Vision Transformer (ViT) model whichserves as an external semantic prior. Specifically, we derive novelrepresentations of structure and appearance extracted from deep ViT features,untwisting them from the learned self-attention modules. We then establish anobjective function that splices the desired structure and appearancerepresentations, interweaving them together in the space of ViT features. Ourframework, which we term "Splice", does not involve adversarial training, nordoes it require any additional input information such as semantic segmentationor correspondences, and can generate high-resolution results, e.g., work in HD.We demonstrate high quality results on a variety of in-the-wild image pairs,under significant variations in the number of objects, their pose andappearance.