CustomIR: Unsupervised Fine-Tuning of Dense Embeddings for Known Document Corpora

Abstract

Dense embedding models have become critical for modern information retrieval,particularly in RAG pipelines, but their performance often degrades whenapplied to specialized corpora outside their pre-training distribution. Toaddress thi we introduce CustomIR, a framework for unsupervised adaptation ofpre-trained language embedding models to domain-specific corpora usingsynthetically generated query-document pairs. CustomIR leverages large languagemodels (LLMs) to create diverse queries grounded in a known target corpus,paired with LLM-verified hard negatives, eliminating the need for costly humanannotation. Experiments on enterprise email and messaging datasets show thatCustomIR consistently improves retrieval effectiveness with small modelsgaining up to 2.3 points in Recall@10. This performance increase allows thesesmall models to rival the performance of much larger alternatives, allowing forcheaper RAG deployments. These results highlight that targeted syntheticfine-tuning offers a scalable and cost-efficient strategy for increasingdomain-specific performance.

Quick Read (beta)

loading the full paper ...