Abstract
Recent advances in image tokenizers, such as VQ-VAE, have enabledtext-to-image generation using auto-regressive methods, similar to languagemodeling. However, these methods have yet to leverage pre-trained languagemodels, despite their adaptability to various downstream tasks. In this work,we explore this gap by adapting a pre-trained language model forauto-regressive text-to-image generation, and find that pre-trained languagemodels offer limited help. We provide a two-fold explanation by analyzingtokens from each modality. First, we demonstrate that image tokens possesssignificantly different semantics compared to text tokens, renderingpre-trained language models no more effective in modeling them than randomlyinitialized ones. Second, the text tokens in the image-text datasets are toosimple compared to normal language model pre-training data, which causes thecatastrophic degradation of language models' capability.