Abstract
CLIP is one of the most important multimodal foundational models today. Whatpowers CLIP's capabilities? The rich supervision signals provided by naturallanguage, the carrier of human knowledge, shape a powerful cross-modalrepresentation space. However, with the rapid advancements in large languagemodels LLMs like GPT-4 and LLaMA, the boundaries of language comprehension andgeneration are continually being pushed. This raises an intriguing question:can the capabilities of LLMs be harnessed to further improve multimodalrepresentation learning? The potential benefits of incorporating LLMs into CLIPare clear. LLMs' strong textual understanding can fundamentally improve CLIP'sability to handle image captions, drastically enhancing its ability to processlong and complex texts, a well-known limitation of vanilla CLIP. Moreover, LLMsare trained on a vast corpus of text, possessing open-world knowledge. Thisallows them to expand on caption information during training, increasing theefficiency of the learning process. In this paper, we propose LLM2CLIP, a novelapproach that embraces the power of LLMs to unlock CLIP's potential. Byfine-tuning the LLM in the caption space with contrastive learning, we extractits textual capabilities into the output embeddings, significantly improvingthe output layer's textual discriminability. We then design an efficienttraining process where the fine-tuned LLM acts as a powerful teacher for CLIP'svisual encoder. Thanks to the LLM's presence, we can now incorporate longer andmore complex captions without being restricted by vanilla CLIP's text encoder'scontext window and ability limitations. Our experiments demonstrate that thisapproach brings substantial improvements in cross-modal tasks.