Abstract
Continual learning enables pre-trained generative vision-language models(VLMs) to incorporate knowledge from new tasks without retraining data fromprevious ones. Recent methods update a visual projector to translate visualinformation for new tasks, connecting pre-trained vision encoders with largelanguage models. However, such adjustments may cause the models to prioritizevisual inputs over language instructions, particularly learning tasks withrepetitive types of textual instructions. To address the neglect of languageinstructions, we propose a novel framework that grounds the translation ofvisual information on instructions for language models. We introduce a mixtureof visual projectors, each serving as a specialized visual-to-languagetranslation expert based on the given instruction context to adapt to newtasks. To avoid using experts for irrelevant instruction contexts, we proposean expert recommendation strategy that reuses experts for tasks similar tothose previously learned. Additionally, we introduce expert pruning toalleviate interference from the use of experts that cumulatively activated inprevious tasks. Extensive experiments on diverse vision-language tasksdemonstrate that our method outperforms existing continual learning approachesby generating instruction-following responses.