M3D-GAN: Multi-Modal Multi-Domain Translation with Universal Attention

Abstract

Generative adversarial networks have led to significant advances incross-modal/domain translation. However, typically these networks are designedfor a specific task (e.g., dialogue generation or image synthesis, but notboth). We present a unified model, M3D-GAN, that can translate across a widerange of modalities (e.g., text, image, and speech) and domains (e.g.,attributes in images or emotions in speech). Our model consists of modalitysubnets that convert data from different modalities into unifiedrepresentations, and a unified computing body where data from differentmodalities share the same network architecture. We introduce a universalattention module that is jointly trained with the whole network and learns toencode a large range of domain information into a highly structured latentspace. We use this to control synthesis in novel ways, such as producingdiverse realistic pictures from a sketch or varying the emotion of synthesizedspeech. We evaluate our approach on extensive benchmark tasks, includingimage-to-image, text-to-image, image captioning, text-to-speech, speechrecognition, and machine translation. Our results show state-of-the-artperformance on some of the tasks.

Quick Read (beta)

loading the full paper ...