InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

Abstract

The Large Vision-Language Model (LVLM) field has seen significantadvancements, yet its progression has been hindered by challenges incomprehending fine-grained visual content due to limited resolution. Recentefforts have aimed to enhance the high-resolution understanding capabilities ofLVLMs, yet they remain capped at approximately 1500 x 1500 pixels andconstrained to a relatively narrow resolution range. This paper representsInternLM-XComposer2-4KHD, a groundbreaking exploration into elevating LVLMresolution capabilities up to 4K HD (3840 x 1600) and beyond. Concurrently,considering the ultra-high resolution may not be necessary in all scenarios, itsupports a wide range of diverse resolutions from 336 pixels to 4K standard,significantly broadening its scope of applicability. Specifically, thisresearch advances the patch division paradigm by introducing a novel extension:dynamic resolution with automatic patch configuration. It maintains thetraining image aspect ratios while automatically varying patch counts andconfiguring layouts based on a pre-trained Vision Transformer (ViT) (336 x336), leading to dynamic training resolution from 336 pixels to 4K standard.Our research demonstrates that scaling training resolution up to 4K HD leads toconsistent performance enhancements without hitting the ceiling of potentialimprovements. InternLM-XComposer2-4KHD shows superb capability that matches oreven surpasses GPT-4V and Gemini Pro in 10 of the 16 benchmarks. TheInternLM-XComposer2-4KHD model series with 7B parameters are publicly availableat https://github.com/InternLM/InternLM-XComposer.

Quick Read (beta)

loading the full paper ...