HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

Abstract

Recent advancements indicate that scaling up Multimodal Large Language Models(MLLMs) effectively enhances performance on downstream multimodal tasks. Theprevailing MLLM paradigm, \emph{e.g.}, LLaVA, transforms visual features intotext-like tokens using a \emph{static} vision-language mapper, thereby enabling\emph{static} LLMs to develop the capability to comprehend visual informationthrough visual instruction tuning. Although promising, the \emph{static} tuningstrategy~\footnote{The static tuning refers to the trained model with staticparameters.} that shares the same parameters may constrain performance acrossdifferent downstream multimodal tasks. In light of this, we introduceHyperLLaVA, which involves adaptive tuning of the projector and LLM parameters,in conjunction with a dynamic visual expert and language expert, respectively.These experts are derived from HyperNetworks, which generates adaptiveparameter shifts through visual and language guidance, enabling dynamicprojector and LLM modeling in two-stage training. Our experiments demonstrate that our solution significantly surpasses LLaVAon existing MLLM benchmarks, including MME, MMBench, SEED-Bench, andLLaVA-Bench. ~\footnote{Our project is available on the linkhttps://github.com/DCDmllm/HyperLLaVA}.

Quick Read (beta)

loading the full paper ...