Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional Properties

Abstract

A major reason behind the recent success of large language models (LLMs) istheir \textit{in-context learning} capability, which makes it possible torapidly adapt them to downstream text-based tasks by prompting them with asmall number of relevant demonstrations. While large vision-language models(VLMs) have recently been developed for tasks requiring both text and images,they largely lack in-context learning over visual information, especially inunderstanding and generating text about videos. In this work, we implement\textbf{E}mergent \textbf{I}n-context \textbf{Le}arning on \textbf{V}ideos(\eilev{}), a novel training paradigm that induces in-context learning overvideo and text by capturing key properties of pre-training data found by priorwork to be essential for in-context learning in transformers. In ourexperiments, we show that \eilev-trained models outperform other off-the-shelfVLMs in few-shot video narration for novel, rare actions. Furthermore, wedemonstrate that these key properties of bursty distributions, skewed marginaldistributions, and dynamic meaning each contribute to varying degrees to VLMs'in-context learning capability in narrating procedural videos. Our results,analysis, and \eilev{}-trained models yield numerous insights about theemergence of in-context learning over video and text, creating a foundation forfuture work to optimize and scale VLMs for open-domain video understanding andreasoning. Our code and demo are available at\url{https://github.com/yukw777/EILEV}.

Quick Read (beta)

loading the full paper ...