FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-training

Abstract

Language-image pre-training faces significant challenges due to limited datain specific formats and the constrained capacities of text encoders. Whileprevailing methods attempt to address these issues through data augmentationand architecture modifications, they continue to struggle with processinglong-form text inputs, and the inherent limitations of traditional CLIP textencoders lead to suboptimal downstream generalization. In this paper, wepropose FLAME (Frozen Large lAnguage Models Enable data-efficientlanguage-image pre-training) that leverages frozen large language models astext encoders, naturally processing long text inputs and demonstratingimpressive multilingual generalization. FLAME comprises two key components: 1)a multifaceted prompt distillation technique for extracting diverse semanticrepresentations from long captions, which better aligns with the multifacetednature of images, and 2) a facet-decoupled attention mechanism, complemented byan offline embedding strategy, to ensure efficient computation. Extensiveempirical evaluations demonstrate FLAME's superior performance. When trained onCC3M, FLAME surpasses the previous state-of-the-art by 4.9% in ImageNet top-1accuracy. On YFCC15M, FLAME surpasses the WIT-400M-trained CLIP by 44.4\% inaverage image-to-text recall@1 across 36 languages, and by 34.6% intext-to-image recall@1 for long-context retrieval on Urban-1k. Code isavailable at https://github.com/MIV-XJTU/FLAME.

Quick Read (beta)

loading the full paper ...