Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning

Abstract

Current solutions for efficiently constructing large vision-language (VL)models follow a two-step paradigm: projecting the output of pre-trained visionencoders to the input space of pre-trained language models as visual prompts;and then transferring the models to downstream VL tasks via end-to-endparameter-efficient fine-tuning (PEFT). However, this paradigm still exhibitsinefficiency since it significantly increases the input length of the languagemodels. In this paper, in contrast to integrating visual prompts into inputs,we regard visual prompts as additional knowledge that facilitates languagemodels in addressing tasks associated with visual information. Motivated by thefinding that Feed-Forward Network (FFN) of language models acts as "key-valuememory", we introduce a novel approach termed memory-space visual prompting(MemVP), wherein visual prompts are concatenated with the weights of FFN forvisual knowledge injection. Experimental results across various VL tasks andlanguage models reveal that MemVP significantly reduces the training time andinference latency of the finetuned VL models and surpasses the performance ofprevious PEFT methods. Code: https://github.com/JieShibo/MemVP

Quick Read (beta)

loading the full paper ...