$VILA^2$: VILA Augmented VILA

Abstract

Visual language models (VLMs) have rapidly progressed, driven by the successof large language models (LLMs). While model architectures and traininginfrastructures advance rapidly, data curation remains under-explored. Whendata quantity and quality become a bottleneck, existing work either directlycrawls more raw data from the Internet that does not have a guarantee of dataquality or distills from black-box commercial models (e.g., GPT-4V / Gemini)causing the performance upper bounded by that model. In this work, we introducea novel approach that includes a self-augment step and a specialist-augmentstep to iteratively improve data quality and model performance. In theself-augment step, a VLM recaptions its own pretraining data to enhance dataquality, and then retrains from scratch using this refined dataset to improvemodel performance. This process can iterate for several rounds. Onceself-augmentation saturates, we employ several specialist VLMs finetuned fromthe self-augmented VLM with domain-specific expertise, to further infusespecialist knowledge into the generalist VLM through task-oriented recaptioningand retraining. With the combined self-augmented and specialist-augmentedtraining, we introduce $VILA^2$ (VILA-augmented-VILA), a VLM family thatconsistently improves the accuracy on a wide range of tasks over prior art, andachieves new state-of-the-art results on MMMU leaderboard among open-sourcedmodels.

Quick Read (beta)

loading the full paper ...