The paper introduces ARES, an Automated RAG Evaluation System, for evaluating retrieval-augmented generation (RAG) systems. ARES aims to provide a rapid and accurate way to evaluate RAG systems without relying heavily on human annotations.
The key highlights of the paper are:
ARES generates its own synthetic training data by leveraging language models to create question-answer pairs derived from a corpus of in-domain passages. This allows ARES to fine-tune lightweight LLM judges to assess the quality of individual RAG components.
To mitigate potential prediction errors, ARES utilizes a small set of human-annotated datapoints for prediction-powered inference (PPI), which provides statistical confidence intervals for the RAG system's performance.
ARES is evaluated across eight different knowledge-intensive tasks from KILT, SuperGLUE, and AIS. The results show that ARES can accurately evaluate RAG systems while using only a few hundred human annotations during evaluation, outperforming existing automated evaluation approaches.
ARES judges remain effective across domain shifts, proving accurate even after changing the type of queries and/or documents used in the evaluated RAG systems.
The paper also explores the importance of human annotations for ARES, finding that a minimum of 150 annotated datapoints is required for the human preference validation set.
Overall, ARES provides a novel and efficient approach for automatically evaluating RAG systems, reducing the need for extensive human annotations while maintaining high accuracy.
翻译成其他语言
从原文生成
arxiv.org
更深入的查询