InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

Abstract

We introduce InternLM-XComposer2, a cutting-edge vision-language modelexcelling in free-form text-image composition and comprehension. This modelgoes beyond conventional vision-language understanding, adeptly craftinginterleaved text-image content from diverse inputs like outlines, detailedtextual specifications, and reference images, enabling highly customizablecontent creation. InternLM-XComposer2 proposes a Partial LoRA (PLoRA) approachthat applies additional LoRA parameters exclusively to image tokens to preservethe integrity of pre-trained language knowledge, striking a balance betweenprecise vision understanding and text composition with literary talent.Experimental results demonstrate the superiority of InternLM-XComposer2 basedon InternLM2-7B in producing high-quality long-text multi-modal content and itsexceptional vision-language understanding performance across variousbenchmarks, where it not only significantly outperforms existing multimodalmodels but also matches or even surpasses GPT-4V and Gemini Pro in certainassessments. This highlights its remarkable proficiency in the realm ofmultimodal understanding. The InternLM-XComposer2 model series with 7Bparameters are publicly available athttps://github.com/InternLM/InternLM-XComposer.

Quick Read (beta)

loading the full paper ...