Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents

Abstract

Text embedding models have emerged as powerful tools for transformingsentences into fixed-sized feature vectors that encapsulate semanticinformation. While these models are essential for tasks like informationretrieval, semantic clustering, and text re-ranking, most existing open-sourcemodels, especially those built on architectures like BERT, struggle torepresent lengthy documents and often resort to truncation. One common approachto mitigate this challenge involves splitting documents into smaller paragraphsfor embedding. However, this strategy results in a much larger set of vectors,consequently leading to increased memory consumption and computationallyintensive vector searches with elevated latency. To address these challenges, we introduce Jina Embeddings 2, an open-sourcetext embedding model capable of accommodating up to 8192 tokens. This model isdesigned to transcend the conventional 512-token limit and adeptly process longdocuments. Jina Embeddings 2 not only achieves state-of-the-art performance ona range of embedding-related tasks in the MTEB benchmark but also matches theperformance of OpenAI's proprietary ada-002 model. Additionally, ourexperiments indicate that an extended context can enhance performance in taskssuch as NarrativeQA.

Quick Read (beta)

loading the full paper ...