Abstract
In this paper, we introduce Janus, an autoregressive framework that unifiesmultimodal understanding and generation. Prior research often relies on asingle visual encoder for both tasks, such as Chameleon. However, due to thediffering levels of information granularity required by multimodalunderstanding and generation, this approach can lead to suboptimal performance,particularly in multimodal understanding. To address this issue, we decouplevisual encoding into separate pathways, while still leveraging a single,unified transformer architecture for processing. The decoupling not onlyalleviates the conflict between the visual encoder's roles in understanding andgeneration, but also enhances the framework's flexibility. For instance, boththe multimodal understanding and generation components can independently selecttheir most suitable encoding methods. Experiments show that Janus surpassesprevious unified model and matches or exceeds the performance of task-specificmodels. The simplicity, high flexibility, and effectiveness of Janus make it astrong candidate for next-generation unified multimodal models.