F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data

  • 2025-10-02 17:58:49
  • Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang
  • 0

Abstract

We introduce F2LLM - Foundation to Feature Large Language Models, a suite ofstate-of-the-art embedding models in three sizes: 0.6B, 1.7B, and 4B. Unlikeprevious top-ranking embedding models that require massive contrastivepretraining, sophisticated training pipelines, and costly synthetic trainingdata, F2LLM is directly finetuned from foundation models on 6 millionquery-document-negative tuples curated from open-source, non-syntheticdatasets, striking a strong balance between training cost, model size, andembedding performance. On the MTEB English leaderboard, F2LLM-4B ranks 2ndamong models with approximately 4B parameters and 7th overall, while F2LLM-1.7Branks 1st among models in the 1B-2B size range. To facilitate future researchin the field, we release the models, training dataset, and code, positioningF2LLM as a strong, reproducible, and budget-friendly baseline for future works.

 

Quick Read (beta)

loading the full paper ...