LIVE: Learnable In-Context Vector for Visual Question Answering

Abstract

As language models continue to scale, Large Language Models (LLMs) haveexhibited emerging capabilities in In-Context Learning (ICL), enabling them tosolve language tasks by prefixing a few in-context demonstrations (ICDs) ascontext. Inspired by these advancements, researchers have extended thesetechniques to develop Large Multimodal Models (LMMs) with ICL capabilities.However, applying ICL usually faces two major challenges: 1) using more ICDswill largely increase the inference time and 2) the performance is sensitive tothe selection of ICDs. These challenges are further exacerbated in LMMs due tothe integration of multiple data types and the combinational complexity ofmultimodal ICDs. Recently, to address these challenges, some NLP studiesintroduce non-learnable In-Context Vectors (ICVs) which extract useful taskinformation from ICDs into a single vector and then insert it into the LLM tohelp solve the corresponding task. However, although useful in simple NLPtasks, these non-learnable methods fail to handle complex multimodal tasks likeVisual Question Answering (VQA). In this study, we propose Learnable In-ContextVEctor (LIVE) to distill essential task information from demonstrations,improving ICL performance in LMMs. Experiments show that LIVE can significantlyreduce computational costs while enhancing accuracy in VQA tasks compared totraditional ICL and other non-learnable ICV methods. The code is available at\url{https://github.com/ForJadeForest/LIVE-Learnable-In-Context-Vector}.

Quick Read (beta)

loading the full paper ...