VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

Abstract

VILA-U is a Unified foundation model that integrates Video, Image, Languageunderstanding and generation. Traditional visual language models (VLMs) useseparate modules for understanding and generating visual content, which canlead to misalignment and increased complexity. In contrast, VILA-U employs asingle autoregressive next-token prediction framework for both tasks,eliminating the need for additional components like diffusion models. Thisapproach not only simplifies the model but also achieves near state-of-the-artperformance in visual language understanding and generation. The success ofVILA-U is attributed to two main factors: the unified vision tower that alignsdiscrete visual tokens with textual inputs during pretraining, which enhancesvisual perception, and autoregressive image generation can achieve similarquality as diffusion models with high-quality dataset. This allows VILA-U toperform comparably to more complex models using a fully token-basedautoregressive framework.

Quick Read (beta)

loading the full paper ...