MirrorGAN: Learning Text-to-image Generation by Redescription

Abstract

Generating an image from a given text description has two goals: visualrealism and semantic consistency. Although significant progress has been madein generating high-quality and visually realistic images using generativeadversarial networks, guaranteeing semantic consistency between the textdescription and visual content remains very challenging. In this paper, weaddress this problem by proposing a novel global-local attentive andsemantic-preserving text-to-image-to-text framework called MirrorGAN. MirrorGANexploits the idea of learning text-to-image generation by redescription andconsists of three modules: a semantic text embedding module (STEM), aglobal-local collaborative attentive module for cascaded image generation(GLAM), and a semantic text regeneration and alignment module (STREAM). STEMgenerates word- and sentence-level embeddings. GLAM has a cascaded architecturefor generating target images from coarse to fine scales, leveraging both localword attention and global sentence attention to progressively enhance thediversity and semantic consistency of the generated images. STREAM seeks toregenerate the text description from the generated image, which semanticallyaligns with the given text description. Thorough experiments on two publicbenchmark datasets demonstrate the superiority of MirrorGAN over otherrepresentative state-of-the-art methods.

Quick Read (beta)

loading the full paper ...