Abstract
Contrastive Language-Image Pre-training (CLIP) has been the cornerstone forzero-shot classification, text-image retrieval, and text-image generation byaligning image and text modalities. Despite its widespread adoption, asignificant limitation of CLIP lies in the inadequate length of text input. Thelength of the text token is restricted to 77, and an empirical study shows theactual effective length is even less than 20. This prevents CLIP from handlingdetailed descriptions, limiting its applications for image retrieval andtext-to-image generation with extensive prerequisites. To this end, we proposeLong-CLIP as a plug-and-play alternative to CLIP that supports long-text input,retains or even surpasses its zero-shot generalizability, and aligns the CLIPlatent space, making it readily replace CLIP without any further adaptation indownstream frameworks. Nevertheless, achieving this goal is far fromstraightforward, as simplistic fine-tuning can result in a significantdegradation of CLIP's performance. Moreover, substituting the text encoder witha language model supporting longer contexts necessitates pretraining with vastamounts of data, incurring significant expenses. Accordingly, Long-CLIPintroduces an efficient fine-tuning solution on CLIP with two novel strategiesdesigned to maintain the original capabilities, including (1) aknowledge-preserved stretching of positional embedding and (2) a primarycomponent matching of CLIP features. With leveraging just one million extralong text-image pairs, Long-CLIP has shown the superiority to CLIP for about20% in long caption text-image retrieval and 6% in traditional text-imageretrieval tasks, e.g., COCO and Flickr30k. Furthermore, Long-CLIP offersenhanced capabilities for generating images from detailed text descriptions byreplacing CLIP in a plug-and-play manner.