FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities

Abstract

The rapid progress of large language models (LLMs) has catalyzed theemergence of multimodal large language models (MLLMs) that unify visualunderstanding and image generation within a single framework. However, mostexisting MLLMs rely on autoregressive (AR) architectures, which impose inherentlimitations on future development, such as the raster-scan order in imagegeneration and restricted reasoning abilities in causal context modeling. Inthis work, we challenge the dominance of AR-based approaches by introducingFUDOKI, a unified multimodal model purely based on discrete flow matching, asan alternative to conventional AR paradigms. By leveraging metric-inducedprobability paths with kinetic optimal velocities, our framework goes beyondthe previous masking-based corruption process, enabling iterative refinementwith self-correction capability and richer bidirectional context integrationduring generation. To mitigate the high cost of training from scratch, weinitialize FUDOKI from pre-trained AR-based MLLMs and adaptively transition tothe discrete flow matching paradigm. Experimental results show that FUDOKIachieves performance comparable to state-of-the-art AR-based MLLMs across bothvisual understanding and image generation tasks, highlighting its potential asa foundation for next-generation unified multimodal models. Furthermore, weshow that applying test-time scaling techniques to FUDOKI yields significantperformance gains, further underscoring its promise for future enhancementthrough reinforcement learning.

Quick Read (beta)

loading the full paper ...