MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis

Abstract

Generative modeling and representation learning are two key tasks in computervision. However, these models are typically trained independently, whichignores the potential for each task to help the other, and leads to trainingand model maintenance overheads. In this work, we propose MAsked GenerativeEncoder (MAGE), the first framework to unify SOTA image generation andself-supervised representation learning. Our key insight is that using variablemasking ratios in masked image modeling pre-training can allow generativetraining (very high masking ratio) and representation learning (lower maskingratio) under the same training framework. Inspired by previous generativemodels, MAGE uses semantic tokens learned by a vector-quantized GAN at inputsand outputs, combining this with masking. We can further improve therepresentation by adding a contrastive loss to the encoder output. Weextensively evaluate the generation and representation learning capabilities ofMAGE. On ImageNet-1K, a single MAGE ViT-L model obtains 9.10 FID in the task ofclass-unconditional image generation and 78.9% top-1 accuracy for linearprobing, achieving state-of-the-art performance in both image generation andrepresentation learning. Code is available at https://github.com/LTH14/mage.

Quick Read (beta)

loading the full paper ...