True Multimodal In-Context Learning Needs Attention to the Visual Context

Abstract

Multimodal Large Language Models (MLLMs), built on powerful languagebackbones, have enabled Multimodal In-Context Learning (MICL)-adapting to newtasks from a few multimodal demonstrations consisting of images, questions, andanswers. Despite showing noticeable improvement on standard vision-languagedatasets, current MLLMs struggle to leverage visual information in thedemonstrations. Specifically, they tend to neglect visual cues and over-rely ontextual patterns, leading to mere text imitation rather than genuine multimodaladaptation. This behavior makes MICL still unimodal and largely restricts itspractical utility. More importantly, this limitation is often concealed by theimproved performance on tasks that do not require understanding the visualcontext. As a result, how to effectively enhance MICL ability and reliablyevaluate the MICL performance remains underexplored. To address these issues,we first introduce Dynamic Attention Reallocation (DARA), an efficientfine-tuning strategy that encourages models to attend to the visual context byrebalancing attention across visual and textual tokens. In addition, we presentTrueMICL, an MICL-dedicated dataset with both support and test sets thatexplicitly requires the integration of multimodal information-particularlyvisual content-for correct task completion. Extensive experiments demonstratethe effectiveness of our holistic solution, showcasing substantial improvementsin the true multimodal in-context learning capabilities. Code and datasets areavailable at https://chenxshuo.github.io/true-micl-colm .

Quick Read (beta)

loading the full paper ...