Multitask Vision-Language Prompt Tuning

Abstract

Prompt Tuning, conditioning on task-specific learned prompt vectors, hasemerged as a data-efficient and parameter-efficient method for adapting largepretrained vision-language models to multiple downstream tasks. However,existing approaches usually consider learning prompt vectors for each taskindependently from scratch, thereby failing to exploit the rich shareableknowledge across different vision-language tasks. In this paper, we proposemultitask vision-language prompt tuning (MVLPT), which incorporates cross-taskknowledge into prompt tuning for vision-language models. Specifically, (i) wedemonstrate the effectiveness of learning a single transferable prompt frommultiple source tasks to initialize the prompt for each target task; (ii) weshow many target tasks can benefit each other from sharing prompt vectors andthus can be jointly learned via multitask prompt tuning. We benchmark theproposed MVLPT using three representative prompt tuning methods, namely textprompt tuning, visual prompt tuning, and the unified vision-language prompttuning. Results in 20 vision tasks demonstrate that the proposed approachoutperforms all single-task baseline prompt tuning methods, setting the newstate-of-the-art on the few-shot ELEVATER benchmarks and cross-taskgeneralization benchmarks. To understand where the cross-task knowledge is mosteffective, we also conduct a large-scale study on task transferability with 20vision tasks in 400 combinations for each prompt tuning method. It shows thatthe most performant MVLPT for each prompt tuning method prefers different taskcombinations and many tasks can benefit each other, depending on their visualsimilarity and label similarity. Code is available athttps://github.com/sIncerass/MVLPT.

Quick Read (beta)

loading the full paper ...