Abstract
Since the resurgence of deep learning, vision-language models (VLMs) enhancedby large language models (LLMs) have grown exponentially in popularity.However, while LLMs can utilize extensive background knowledge and taskinformation with in-context learning, most VLMs still struggle withunderstanding complex multi-modal prompts with multiple images, making VLMsless effective in downstream vision-language tasks. In this paper, we addressthe limitation above by 1) introducing MMICL, a new approach to allow the VLMto deal with multi-modal inputs efficiently; 2) proposing a novel contextscheme to augment the in-context learning ability of the VLM; 3) constructingthe Multi-modal In-Context Learning (MIC) dataset, designed to enhance theVLM's ability to understand complex multi-modal prompts. Our experimentsconfirm that MMICL achieves new state-of-the-art zero-shot performance on awide range of general vision-language tasks, especially for complex benchmarks,including MME and MMBench. Our analysis demonstrates that MMICL effectivelytackles the challenge of complex multi-modal prompt understanding and emergesthe impressive ICL ability. Furthermore, we observe that MMICL successfullyalleviates language bias in VLMs, a common issue for VLMs that often leads tohallucination when faced with extensive textual context.