Variational Mixture-of-Experts Autoencoders for Multi-Modal Deep Generative Models

Abstract

Learning generative models that span multiple data modalities, such as visionand language, is often motivated by the desire to learn more useful,generalisable representations that faithfully capture common underlying factorsbetween the modalities. In this work, we characterise successful learning ofsuch models as the fulfillment of four criteria: i) implicit latentdecomposition into shared and private subspaces, ii) coherent joint generationover all modalities, iii) coherent cross-generation across individualmodalities, and iv) improved model learning for individual modalities throughmulti-modal integration. Here, we propose a mixture-of-experts multimodalvariational autoencoder (MMVAE) to learn generative models on different sets ofmodalities, including a challenging image-language dataset, and demonstrate itsability to satisfy all four criteria, both qualitatively and quantitatively.

Quick Read (beta)

loading the full paper ...