InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

Abstract

We propose InternLM-XComposer, a vision-language large model that enablesadvanced image-text comprehension and composition. The innovative nature of ourmodel is highlighted by three appealing properties: 1) Interleaved Text-ImageComposition: InternLM-XComposer can effortlessly generate coherent andcontextual articles that seamlessly integrate images, providing a more engagingand immersive reading experience. Simply provide a title, and our system willgenerate the corresponding manuscript. It can intelligently identify the areasin the text where images would enhance the content and automatically insert themost appropriate visual candidates. 2) Comprehension with Rich MultilingualKnowledge: The text-image comprehension is empowered by training on extensivemulti-modal multilingual concepts with carefully crafted strategies, resultingin a deep understanding of visual content. 3) State-of-the-art Performance: Ourmodel consistently achieves state-of-the-art results across various mainstreambenchmarks for vision-language foundational models, including MME Benchmark,MMBench, MMBench-CN, Seed-Bench, and CCBench (Chinese Cultural Benchmark).Collectively, InternLM-XComposer seamlessly blends advanced text-imagecomprehension and composition, revolutionizing vision-language interaction andoffering new insights and opportunities. The InternLM-XComposer models with 7Bparameters are publicly available athttps://github.com/InternLM/InternLM-XComposer.

Quick Read (beta)

loading the full paper ...