Abstract
The rapid advance of Large Language Models (LLMs) has catalyzed thedevelopment of Vision-Language Models (VLMs). Monolithic VLMs, which avoidmodality-specific encoders, offer a promising alternative to the compositionalones but face the challenge of inferior performance. Most existing monolithicVLMs require tuning pre-trained LLMs to acquire vision abilities, which maydegrade their language capabilities. To address this dilemma, this paperpresents a novel high-performance monolithic VLM named HoVLE. We note that LLMshave been shown capable of interpreting images, when image embeddings arealigned with text embeddings. The challenge for current monolithic VLMsactually lies in the lack of a holistic embedding module for both vision andlanguage inputs. Therefore, HoVLE introduces a holistic embedding module thatconverts visual and textual inputs into a shared space, allowing LLMs toprocess images in the same way as texts. Furthermore, a multi-stage trainingstrategy is carefully designed to empower the holistic embedding module. It isfirst trained to distill visual features from a pre-trained vision encoder andtext embeddings from the LLM, enabling large-scale training with unpairedrandom images and text tokens. The whole model further undergoes next-tokenprediction on multi-modal data to align the embeddings. Finally, aninstruction-tuning stage is incorporated. Our experiments show that HoVLEachieves performance close to leading compositional models on variousbenchmarks, outperforming previous monolithic models by a large margin. Modelavailable at https://huggingface.co/OpenGVLab/HoVLE.