Abstract
Numerous efforts have been made to extend the ``next token prediction''paradigm to visual contents, aiming to create a unified approach for both imagegeneration and understanding. Nevertheless, attempts to generate images throughautoregressive modeling with discrete tokens have been plagued by issues suchas low visual fidelity, distorted outputs, and failure to adhere to complexinstructions when rendering intricate details. These shortcomings are likelyattributed to cumulative errors during autoregressive inference or informationloss incurred during the discretization process. Probably due to thischallenge, recent research has increasingly shifted toward jointly trainingimage generation with diffusion objectives and language generation withautoregressive objectives, moving away from unified modeling approaches. Inthis work, we demonstrate that reinforcement learning can effectively mitigateartifacts and largely enhance the generation quality of a discreteautoregressive modeling method, thereby enabling seamless integration of imageand language generation. Our framework comprises a semantic image tokenizer, aunified autoregressive model for both language and images, and an offlinediffusion decoder for image generation, termed X-Omni. X-Omni achievesstate-of-the-art performance in image generation tasks using a 7B languagemodel, producing images with high aesthetic quality while exhibiting strongcapabilities in following instructions and rendering long texts.