Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Abstract

We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VLmodels that redefines the conventional predetermined-resolution approach invisual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism,which enables the model to dynamically process images of varying resolutionsinto different numbers of visual tokens. This approach allows the model togenerate more efficient and accurate visual representations, closely aligningwith human perceptual processes. The model also integrates Multimodal RotaryPosition Embedding (M-RoPE), facilitating the effective fusion of positionalinformation across text, images, and videos. We employ a unified paradigm forprocessing both images and videos, enhancing the model's visual perceptioncapabilities. To explore the potential of large multimodal models, Qwen2-VLinvestigates the scaling laws for large vision-language models (LVLMs). Byscaling both the model size-with versions at 2B, 8B, and 72B parameters-and theamount of training data, the Qwen2-VL Series achieves highly competitiveperformance. Notably, the Qwen2-VL-72B model achieves results comparable toleading models such as GPT-4o and Claude3.5-Sonnet across various multimodalbenchmarks, outperforming other generalist models. Code is available at\url{https://github.com/QwenLM/Qwen2-VL}.

Quick Read (beta)

loading the full paper ...