Abstract
We introduce the Qwen-VL series, a set of large-scale vision-language modelsdesigned to perceive and understand both text and images. Comprising Qwen-VLand Qwen-VL-Chat, these models exhibit remarkable performance in tasks likeimage captioning, question answering, visual localization, and flexibleinteraction. The evaluation covers a wide range of tasks including zero-shotcaptioning, visual or document visual question answering, and grounding. Wedemonstrate the Qwen-VL outperforms existing Large Vision Language Models(LVLMs). We present their architecture, training, capabilities, andperformance, highlighting their contributions to advancing multimodalartificial intelligence. Code, demo and models are available athttps://github.com/QwenLM/Qwen-VL.