Unveiling Encoder-Free Vision-Language Models

Abstract

Existing vision-language models (VLMs) mostly rely on vision encoders toextract visual features followed by large language models (LLMs) forvisual-language tasks. However, the vision encoders set a strong inductive biasin abstracting visual representation, e.g., resolution, aspect ratio, andsemantic priors, which could impede the flexibility and efficiency of the VLMs.Training pure VLMs that accept the seamless vision and language inputs, i.e.,without vision encoders, remains challenging and rarely explored. Empiricalobservations reveal that direct training without encoders results in slowconvergence and large performance gaps. In this work, we bridge the gap betweenencoder-based and encoder-free models, and present a simple yet effectivetraining recipe towards pure VLMs. Specifically, we unveil the key aspects oftraining encoder-free VLMs efficiently via thorough experiments: (1) Bridgingvision-language representation inside one unified decoder; (2) Enhancing visualrecognition capability via extra supervision. With these strategies, we launchEVE, an encoder-free vision-language model that can be trained and forwardedefficiently. Notably, solely utilizing 35M publicly accessible data, EVE canimpressively rival the encoder-based VLMs of similar capacities across multiplevision-language benchmarks. It significantly outperforms the counterpartFuyu-8B with mysterious training procedures and undisclosed training data. Webelieve that EVE provides a transparent and efficient route for developing apure decoder-only architecture across modalities. Our code and models arepublicly available at: https://github.com/baaivision/EVE.

Quick Read (beta)

loading the full paper ...