Abstract
High-quality benchmarks are essential for evaluating reasoning and retrievalcapabilities of large language models (LLMs). However, curating datasets forthis purpose is not a permanent solution as they are prone to data leakage andinflated performance results. To address these challenges, we proposePhantomWiki: a pipeline to generate unique, factually consistent documentcorpora with diverse question-answer pairs. Unlike prior work, PhantomWiki isneither a fixed dataset, nor is it based on any existing data. Instead, a newPhantomWiki instance is generated on demand for each evaluation. We vary thequestion difficulty and corpus size to disentangle reasoning and retrievalcapabilities respectively, and find that PhantomWiki datasets are surprisinglychallenging for frontier LLMs. Thus, we contribute a scalable and dataleakage-resistant framework for disentangled evaluation of reasoning,retrieval, and tool-use abilities. Our code is available athttps://github.com/kilian-group/phantom-wiki.