MMAR: Towards Lossless Multi-Modal Auto-Regressive Prababilistic Modeling

Abstract

Recent advancements in multi-modal large language models have propelled thedevelopment of joint probabilistic models capable of both image understandingand generation. However, we have identifed that recent methods inevitablysuffer from loss of image information during understanding task, due to eitherimage discretization or diffusion denoising steps. To address this issue, wepropose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modelingframework. Unlike discretization line of method, MMAR takes incontinuous-valued image tokens to avoid information loss. Differing fromdiffusion-based approaches, we disentangle the diffusion process fromauto-regressive backbone model by employing a light-weight diffusion head ontop each auto-regressed image patch embedding. In this way, when the modeltransits from image generation to understanding through text generation, thebackbone model's hidden representation of the image is not limited to the lastdenoising step. To successfully train our method, we also propose atheoretically proven technique that addresses the numerical stability issue anda training strategy that balances the generation and understanding task goals.Through extensive evaluations on 18 image understanding benchmarks, MMARdemonstrates much more superior performance than other joint multi-modalmodels, matching the method that employs pretrained CLIP vision encoder,meanwhile being able to generate high quality images at the same time. We alsoshowed that our method is scalable with larger data and model size.

Quick Read (beta)

loading the full paper ...