Abstract
Evaluation plays a crucial role in the advancement of information retrieval(IR) models. However, current benchmarks, which are based on predefined domainsand human-labeled data, face limitations in addressing evaluation needs foremerging domains both cost-effectively and efficiently. To address thischallenge, we propose the Automated Heterogeneous Information RetrievalBenchmark (AIR-Bench). AIR-Bench is distinguished by three key features: 1)Automated. The testing data in AIR-Bench is automatically generated by largelanguage models (LLMs) without human intervention. 2) Heterogeneous. Thetesting data in AIR-Bench is generated with respect to diverse tasks, domainsand languages. 3) Dynamic. The domains and languages covered by AIR-Bench areconstantly augmented to provide an increasingly comprehensive evaluationbenchmark for community developers. We develop a reliable and robust datageneration pipeline to automatically create diverse and high-quality evaluationdatasets based on real-world corpora. Our findings demonstrate that thegenerated testing data in AIR-Bench aligns well with human-labeled testingdata, making AIR-Bench a dependable benchmark for evaluating IR models. Theresources in AIR-Bench are publicly available athttps://github.com/AIR-Bench/AIR-Bench.