KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model

Abstract

As retrieval-augmented generation prevails in large language models,embedding models are becoming increasingly crucial. Despite the growing numberof general embedding models, prior work often overlooks the critical role oftraining data quality. In this work, we introduce KaLM-Embedding, a generalmultilingual embedding model that leverages a large quantity of cleaner, morediverse, and domain-specific training data. Our model has been trained with keytechniques proven to enhance performance: (1) persona-based synthetic data tocreate diversified examples distilled from LLMs, (2) ranking consistencyfiltering to remove less informative samples, and (3) semi-homogeneous taskbatch sampling to improve training efficacy. Departing from traditionalBERT-like architectures, we adopt Qwen2-0.5B as the pre-trained model,facilitating the adaptation of auto-regressive language models for generalembedding tasks. Extensive evaluations of the MTEB benchmark across multiplelanguages show that our model outperforms others of comparable size, setting anew standard for multilingual embedding models with <1B parameters.

Quick Read (beta)

loading the full paper ...