GenRecal: Generation after Recalibration from Large to Small Vision-Language Models

Abstract

Recent advancements in vision-language models (VLMs) have leveraged largelanguage models (LLMs) to achieve performance on par with closed-source systemslike GPT-4V. However, deploying these models in real-world scenarios,particularly on resource-constrained devices, remains challenging due to theirsubstantial computational demands. This has spurred interest in distillingknowledge from large VLMs into smaller, more efficient counterparts. A keychallenge arises here from the diversity of VLM architectures, which are builton different LLMs and employ varying token types-differing in vocabulary size,token splits, and token index ordering. To address this challenge of limitationto a specific VLM type, we present Generation after Recalibration (GenRecal), anovel, general-purpose distillation framework for VLMs. GenRecal incorporates aRecalibrator that aligns and adapts feature representations betweenheterogeneous VLMs, enabling effective knowledge transfer across differenttypes of VLMs. Through extensive experiments on multiple challengingbenchmarks, we demonstrate that GenRecal significantly improves baselineperformances, eventually outperforming large-scale open- and closed-sourceVLMs.

Quick Read (beta)

loading the full paper ...