Abstract
Using large language models (LMs) for query or document expansion can improvegeneralization in information retrieval. However, it is unknown whether thesetechniques are universally beneficial or only effective in specific settings,such as for particular retrieval models, dataset domains, or query types. Toanswer this, we conduct the first comprehensive analysis of LM-based expansion.We find that there exists a strong negative correlation between retrieverperformance and gains from expansion: expansion improves scores for weakermodels, but generally harms stronger models. We show this trend holds across aset of eleven expansion techniques, twelve datasets with diverse distributionshifts, and twenty-four retrieval models. Through qualitative error analysis,we hypothesize that although expansions provide extra information (potentiallyimproving recall), they add additional noise that makes it difficult to discernbetween the top relevant documents (thus introducing false positives). Ourresults suggest the following recipe: use expansions for weaker models or whenthe target dataset significantly differs from training corpus in format;otherwise, avoid expansions to keep the relevance signal clear.