Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better

Abstract

Typical large vision-language models (LVLMs) apply autoregressive supervisionsolely to textual sequences, without fully incorporating the visual modalityinto the learning process. This results in three key limitations: (1) aninability to utilize images without accompanying captions, (2) the risk thatcaptions omit critical visual details, and (3) the challenge that certainvision-centric content cannot be adequately conveyed through text. As a result,current LVLMs often prioritize vision-to-language alignment while potentiallyoverlooking fine-grained visual information. While some prior works haveexplored autoregressive image generation, effectively leveraging autoregressivevisual supervision to enhance image understanding remains an open challenge. Inthis paper, we introduce Autoregressive Semantic Visual Reconstruction (ASVR),which enables joint learning of visual and textual modalities within a unifiedautoregressive framework. We show that autoregressively reconstructing the rawvisual appearance of images does not enhance and may even impair multimodalunderstanding. In contrast, autoregressively reconstructing the semanticrepresentation of images consistently improves comprehension. Notably, we findthat even when models are given continuous image features as input, they caneffectively reconstruct discrete semantic tokens, resulting in stable andconsistent improvements across a wide range of multimodal understandingbenchmarks. Our approach delivers significant performance gains across varyingdata scales (556k-2M) and types of LLM bacbones. Specifically, ASVR improvesLLaVA-1.5 by 5% in average scores across 14 multimodal benchmarks. The code isavailable at https://github.com/AlenjandroWang/ASVR.

Quick Read (beta)

loading the full paper ...