ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems

Abstract

Evaluating retrieval-augmented generation (RAG) systems traditionally relieson hand annotations for input queries, passages to retrieve, and responses togenerate. We introduce ARES, an Automated RAG Evaluation System, for evaluatingRAG systems along the dimensions of context relevance, answer faithfulness, andanswer relevance. By creating its own synthetic training data, ARES finetuneslightweight LM judges to assess the quality of individual RAG components. Tomitigate potential prediction errors, ARES utilizes a small set ofhuman-annotated datapoints for prediction-powered inference (PPI). Across eightdifferent knowledge-intensive tasks in KILT, SuperGLUE, and AIS, ARESaccurately evaluates RAG systems while using only a few hundred humanannotations during evaluation. Furthermore, ARES judges remain effective acrossdomain shifts, proving accurate even after changing the type of queries and/ordocuments used in the evaluated RAG systems. We make our code and datasetspublicly available on Github.

Quick Read (beta)

loading the full paper ...