Abstract
Generative retrieval has emerged as a novel paradigm that leverages largelanguage models (LLMs) to autoregressively generate document identifiers.Although promising, the mechanisms that underpin its performance andscalability remain largely unclear. We conduct a systematic investigation oftraining and inference scaling laws in generative retrieval, exploring howmodel size, training data scale, and inference-time compute jointly influenceretrieval performance. To address the lack of suitable metrics, we propose anovel evaluation measure inspired by contrastive entropy and generation loss,providing a continuous performance signal that enables robust comparisonsacross diverse generative retrieval methods. Our experiments show thatn-gram-based methods demonstrate strong alignment with both training andinference scaling laws, especially when paired with larger LLMs. Furthermore,increasing inference computation yields substantial performance gains,revealing that generative retrieval can significantly benefit from highercompute budgets at inference. Across these settings, LLaMA models consistentlyoutperform T5 models, suggesting a particular advantage for larger decoder-onlymodels in generative retrieval. Taken together, our findings underscore thatmodel sizes, data availability, and inference computation interact to unlockthe full potential of generative retrieval, offering new insights for designingand optimizing future systems.