DRAMA: Diverse Augmentation from Large Language Models to Smaller Dense Retrievers

Abstract

Large language models (LLMs) have demonstrated strong effectiveness androbustness while fine-tuned as dense retrievers. However, their large parametersize brings significant inference time computational challenges, including highencoding costs for large-scale corpora and increased query latency, limitingtheir practical deployment. While smaller retrievers offer better efficiency,they often fail to generalize effectively with limited supervised fine-tuningdata. In this work, we introduce DRAMA, a training framework that leveragesLLMs to train smaller generalizable dense retrievers. In particular, we adoptpruned LLMs as the backbone and train on diverse LLM-augmented data in asingle-stage contrastive learning setup. Experiments show that DRAMA offersbetter multilingual and long-context capabilities than traditionalencoder-based retrievers, and achieves strong performance across multiple tasksand languages. These highlight the potential of connecting the training ofsmaller retrievers with the growing advancements in LLMs, bridging the gapbetween efficiency and generalization.

Quick Read (beta)

loading the full paper ...