Abstract
Recent studies have demonstrated the importance of high-quality visualrepresentations in image generation and have highlighted the limitations ofgenerative models in image understanding. As a generative paradigm originallydesigned for natural language, autoregressive models face similar challenges.In this work, we present the first systematic investigation into the mechanismsof applying the next-token prediction paradigm to the visual domain. Weidentify three key properties that hinder the learning of high-level visualsemantics: local and conditional dependence, inter-step semantic inconsistency,and spatial invariance deficiency. We show that these issues can be effectivelyaddressed by introducing self-supervised objectives during training, leading toa novel training framework, Self-guided Training for AutoRegressive models(ST-AR). Without relying on pre-trained representation models, ST-ARsignificantly enhances the image understanding ability of autoregressive modelsand leads to improved generation quality. Specifically, ST-AR bringsapproximately 42% FID improvement for LlamaGen-L and 49% FID improvement forLlamaGen-XL, while maintaining the same sampling strategy.