VeriFastScore: Speeding up long-form factuality evaluation

Abstract

Metrics like FactScore and VeriScore that evaluate long-form factualityoperate by decomposing an input response into atomic claims and thenindividually verifying each claim. While effective and interpretable, thesemethods incur numerous LLM calls and can take upwards of 100 seconds toevaluate a single response, limiting their practicality in large-scaleevaluation and training scenarios. To address this, we propose VeriFastScore,which leverages synthetic data to fine-tune Llama3.1 8B for simultaneouslyextracting and verifying all verifiable claims within a given text based onevidence from Google Search. We show that this task cannot be solved viafew-shot prompting with closed LLMs due to its complexity: the model receives~4K tokens of evidence on average and needs to concurrently decompose claims,judge their verifiability, and verify them against noisy evidence. However, ourfine-tuned VeriFastScore model demonstrates strong correlation with theoriginal VeriScore pipeline at both the example level (r=0.80) and system level(r=0.94) while achieving an overall speedup of 6.6x (9.9x excluding evidenceretrieval) over VeriScore. To facilitate future factuality research, wepublicly release our VeriFastScore model and synthetic datasets.

Quick Read (beta)

loading the full paper ...