xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

Abstract

With the release of the o1 model by OpenAI, reasoning models adopting slowthinking strategies have gradually emerged. As the responses generated by suchmodels often include complex reasoning, intermediate steps, andself-reflection, existing evaluation methods are often inadequate. Theystruggle to determine whether the LLM output is truly equivalent to thereference answer, and also have difficulty identifying and extracting the finalanswer from long, complex responses. To address this issue, we propose xVerify,an efficient answer verifier for reasoning model evaluations. xVerifydemonstrates strong capability in equivalence judgment, enabling it toeffectively determine whether the answers produced by reasoning models areequivalent to reference answers across various types of objective questions. Totrain and evaluate xVerify, we construct the VAR dataset by collectingquestion-answer pairs generated by multiple LLMs across various datasets,leveraging multiple reasoning models and challenging evaluation sets designedspecifically for reasoning model assessment. A multi-round annotation processis employed to ensure label accuracy. Based on the VAR dataset, we trainmultiple xVerify models of different scales. In evaluation experimentsconducted on both the test set and generalization set, all xVerify modelsachieve overall F1 scores and accuracy exceeding 95\%. Notably, the smallestvariant, xVerify-0.5B-I, outperforms all evaluation methods except GPT-4o,while xVerify-3B-Ib surpasses GPT-4o in overall performance. These resultsvalidate the effectiveness and generalizability of xVerify.

Quick Read (beta)

loading the full paper ...