Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation

  • 2024-09-25 18:58:21
  • Yuhui Zhang, Brandon McKinzie, Zhe Gan, Vaishaal Shankar, Alexander Toshev
  • 0

Abstract

Recent advances in image tokenizers, such as VQ-VAE, have enabledtext-to-image generation using auto-regressive methods, similar to languagemodeling. However, these methods have yet to leverage pre-trained languagemodels, despite their adaptability to various downstream tasks. In this work,we explore this gap by adapting a pre-trained language model forauto-regressive text-to-image generation, and find that pre-trained languagemodels offer limited help. We provide a two-fold explanation by analyzingtokens from each modality. First, we demonstrate that image tokens possesssignificantly different semantics compared to text tokens, renderingpre-trained language models no more effective in modeling them than randomlyinitialized ones. Second, the text tokens in the image-text datasets are toosimple compared to normal language model pre-training data, which causes thecatastrophic degradation of language models' capability.

 

Quick Read (beta)

loading the full paper ...