HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

  • 2024-03-20 10:42:43
  • Wenqiao Zhang, Tianwei Lin, Jiang Liu, Fangxun Shu, Haoyuan Li, Lei Zhang, He Wanggui, Hao Zhou, Zheqi Lv, Hao Jiang, Juncheng Li, Siliang Tang, Yueting Zhuang
  • 0

Abstract

Recent advancements indicate that scaling up Multimodal Large Language Models(MLLMs) effectively enhances performance on downstream multimodal tasks. Theprevailing MLLM paradigm, \emph{e.g.}, LLaVA, transforms visual features intotext-like tokens using a \emph{static} vision-language mapper, thereby enabling\emph{static} LLMs to develop the capability to comprehend visual informationthrough visual instruction tuning. Although promising, the \emph{static} tuningstrategy~\footnote{The static tuning refers to the trained model with staticparameters.} that shares the same parameters may constrain performance acrossdifferent downstream multimodal tasks. In light of this, we introduceHyperLLaVA, which involves adaptive tuning of the projector and LLM parameters,in conjunction with a dynamic visual expert and language expert, respectively.These experts are derived from HyperNetworks, which generates adaptiveparameter shifts through visual and language guidance, enabling dynamicprojector and LLM modeling in two-stage training. Our experiments demonstrate that our solution significantly surpasses LLaVAon existing MLLM benchmarks, including MME, MMBench, SEED-Bench, andLLaVA-Bench. ~\footnote{Our project is available on the linkhttps://github.com/DCDmllm/HyperLLaVA}.

 

Quick Read (beta)

loading the full paper ...