jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images

Abstract

Contrastive Language-Image Pretraining (CLIP) has been widely used forcrossmodal information retrieval and multimodal understanding tasks. However,CLIP models are mainly optimized for crossmodal vision-language tasks andunderperform in single-mode text tasks. Moreover, these models are oftentrained on English datasets and therefore lack multilingual understanding.Additionally, from a visual understanding perspective, previous CLIP-basedmodels exhibit insufficient understanding of visually rich documents. In thiswork, we propose jina-clip-v2, a contrastive vision-language model trained ontext pairs, triplets and image-text pairs via a multi-task and multi-stagecontrastive learning paradigm in order to support both text-only and crossmodaltasks. We employ a multilingual text encoder and expand the training dataset toinclude multilingual texts from 29 non-English languages, including Hindi,Chinese, German, French, and others, as well as images of visually richdocuments. We evaluate the model's performance and show that jina-clip-v2achieves notable improvements over state-of-the-art CLIP-based models inzero-shot text-only retrieval, semantic textual similarity, and crossmodalretrieval tasks in both English and multilingual settings. jina-clip-v2 alsoprovides for flexibility in embedding dimensionality, enabling users to selectthe granularity of the representations. jina-clip-v2 is publicly available athttps://huggingface.co/jinaai/jina-clip-v2.

Quick Read (beta)

loading the full paper ...