MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

  • 2023-10-02 15:46:01
  • Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, Baobao Chang
  • 0

Abstract

Since the resurgence of deep learning, vision-language models (VLMs) enhancedby large language models (LLMs) have grown exponentially in popularity.However, while LLMs can utilize extensive background knowledge and taskinformation with in-context learning, most VLMs still struggle withunderstanding complex multi-modal prompts with multiple images, making VLMsless effective in downstream vision-language tasks. In this paper, we addressthe limitation above by 1) introducing MMICL, a new approach to allow the VLMto deal with multi-modal inputs efficiently; 2) proposing a novel contextscheme to augment the in-context learning ability of the VLM; 3) constructingthe Multi-modal In-Context Learning (MIC) dataset, designed to enhance theVLM's ability to understand complex multi-modal prompts. Our experimentsconfirm that MMICL achieves new state-of-the-art zero-shot performance on awide range of general vision-language tasks, especially for complex benchmarks,including MME and MMBench. Our analysis demonstrates that MMICL effectivelytackles the challenge of complex multi-modal prompt understanding and emergesthe impressive ICL ability. Furthermore, we observe that MMICL successfullyalleviates language bias in VLMs, a common issue for VLMs that often leads tohallucination when faced with extensive textual context.

 

Quick Read (beta)

loading the full paper ...