Abstract
The goal of this paper is to embed controllable factors, i.e., naturallanguage descriptions, into image-to-image translation with generativeadversarial networks, which allows text descriptions to determine the visualattributes of synthetic images. We propose four key components: (1) theimplementation of part-of-speech tagging to filter out non-semantic words inthe given description, (2) the adoption of an affine combination module toeffectively fuse different modality text and image features, (3) a novelrefined multi-stage architecture to strengthen the differential ability ofdiscriminators and the rectification ability of generators, and (4) a newstructure loss to further improve discriminators to better distinguish real andsynthetic images. Extensive experiments on the COCO dataset demonstrate thatour method has a superior performance on both visual realism and semanticconsistency with given descriptions.