MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

Abstract

Since the resurgence of deep learning, vision-language models (VLMs) enhancedby large language models (LLMs) have grown exponentially in popularity.However, while LLMs can utilize extensive background knowledge and taskinformation with in-context learning, most VLMs still struggle withunderstanding complex multi-modal prompts with multiple images, making VLMsless effective in downstream vision-language tasks. In this paper, we addressthe limitation above by 1) introducing vision-language Model with Multi-ModalIn-Context Learning(MMICL), a new approach to allow the VLM to deal withmulti-modal inputs efficiently; 2) proposing a novel context scheme to augmentthe in-context learning ability of the VLM; 3) constructing the Multi-modalIn-Context Learning (MIC) dataset, designed to enhance the VLM's ability tounderstand complex multi-modal prompts. Our experiments confirm that MMICLachieves new state-of-the-art zero-shot performance on a wide range of generalvision-language tasks, especially for complex benchmarks, including MME andMMBench. Our analysis demonstrates that MMICL effectively tackles the challengeof complex multi-modal prompt understanding and emerges the impressive ICLability. Furthermore, we observe that MMICL successfully alleviates languagebias in VLMs, a common issue for VLMs that often leads to hallucination whenfaced with extensive textual context. Our code, dataset, dataset tool, andmodel are available at https://github.com/PKUnlp-icler/MIC

Quick Read (beta)

loading the full paper ...