LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

Abstract

The success of Large Language Models (LLM) has led researchers to exploreMultimodal Large Language Models (MLLM) for unified visual and linguisticunderstanding. However, the increasing model size and computational complexityof MLLM limit their use in resource-constrained environments. Small-scale MLLM(s-MLLM) aims to retain the capabilities of the large-scale model (l-MLLM)while reducing computational demands, but resulting in a significant decline inperformance. To address the aforementioned issues, we propose a novel LLaVA-KDframework to transfer knowledge from l-MLLM to s-MLLM. Specifically, weintroduce Multimodal Distillation (MDist) to minimize the divergence betweenthe visual-textual output distributions of l-MLLM and s-MLLM, and RelationDistillation (RDist) to transfer l-MLLM's ability to model correlations betweenvisual features. Additionally, we propose a three-stage training scheme tofully exploit the potential of s-MLLM: 1) Distilled Pre-Training to alignvisual-textual representations, 2) Supervised Fine-Tuning to equip the modelwith multimodal understanding, and 3) Distilled Fine-Tuning to further transferl-MLLM capabilities. Our approach significantly improves performance withoutaltering the small model's architecture. Extensive experiments and ablationstudies validate the effectiveness of each proposed component. Code will beavailable at https://github.com/caiyuxuan1120/LLaVA-KD.

Quick Read (beta)

loading the full paper ...