Abstract
Modern Vision-Language Models (VLMs) can solve a wide range of tasksrequiring visual reasoning. In real-world scenarios, desirable properties forVLMs include fast inference and controllable generation (e.g., constrainingoutputs to adhere to a desired format). However, existing autoregressive (AR)VLMs like LLaVA struggle in these aspects. Discrete diffusion models (DMs)offer a promising alternative, enabling parallel decoding for faster inferenceand bidirectional context for controllable generation through text-infilling.While effective in language-only settings, DMs' potential for multimodal tasksis underexplored. We introduce LaViDa, a family of VLMs built on DMs. We buildLaViDa by equipping DMs with a vision encoder and jointly fine-tune thecombined parts for multimodal instruction following. To address challengesencountered, LaViDa incorporates novel techniques such as complementary maskingfor effective training, prefix KV cache for efficient inference, and timestepshifting for high-quality sampling. Experiments show that LaViDa achievescompetitive or superior performance to AR VLMs on multi-modal benchmarks suchas MMMU, while offering unique advantages of DMs, including flexiblespeed-quality tradeoff, controllability, and bidirectional reasoning. On COCOcaptioning, LaViDa surpasses Open-LLaVa-Next-8B by +4.1 CIDEr with 1.92xspeedup. On bidirectional tasks, it achieves +59% improvement on ConstrainedPoem Completion. These results demonstrate LaViDa as a strong alternative to ARVLMs. Code and models will be released in the camera-ready version.