Abstract
Evaluating the factuality of long-form text generated by large languagemodels (LMs) is non-trivial because (1) generations often contain a mixture ofsupported and unsupported pieces of information, making binary judgments ofquality inadequate, and (2) human evaluation is time-consuming and costly. Inthis paper, we introduce FACTSCORE, a new evaluation that breaks a generationinto a series of atomic facts and computes the percentage of atomic factssupported by a reliable knowledge source. We conduct an extensive humanevaluation to obtain FACTSCOREs of people biographies generated by severalstate-of-the-art commercial LMs -- InstructGPT, ChatGPT, and theretrieval-augmented PerplexityAI -- and report new analysis demonstrating theneed for such a fine-grained score (e.g., ChatGPT only achieves 58%). Sincehuman evaluation is costly, we also introduce an automated model that estimatesFACTSCORE using retrieval and a strong language model, with less than a 2%error rate. Finally, we use this automated metric to evaluate 6,500 generationsfrom a new set of 13 recent LMs that would have cost $26K if evaluated byhumans, with various findings: GPT-4 and ChatGPT are more factual than publicmodels, and Vicuna and Alpaca are some of the best public models. FACTSCORE isavailable for public use via `pip install factscore`.