InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

  • 2024-01-29 18:59:02
  • Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang
  • 0

Abstract

We introduce InternLM-XComposer2, a cutting-edge vision-language modelexcelling in free-form text-image composition and comprehension. This modelgoes beyond conventional vision-language understanding, adeptly craftinginterleaved text-image content from diverse inputs like outlines, detailedtextual specifications, and reference images, enabling highly customizablecontent creation. InternLM-XComposer2 proposes a Partial LoRA (PLoRA) approachthat applies additional LoRA parameters exclusively to image tokens to preservethe integrity of pre-trained language knowledge, striking a balance betweenprecise vision understanding and text composition with literary talent.Experimental results demonstrate the superiority of InternLM-XComposer2 basedon InternLM2-7B in producing high-quality long-text multi-modal content and itsexceptional vision-language understanding performance across variousbenchmarks, where it not only significantly outperforms existing multimodalmodels but also matches or even surpasses GPT-4V and Gemini Pro in certainassessments. This highlights its remarkable proficiency in the realm ofmultimodal understanding. The InternLM-XComposer2 model series with 7Bparameters are publicly available athttps://github.com/InternLM/InternLM-XComposer.

 

Quick Read (beta)

loading the full paper ...