CoverBench: A Challenging Benchmark for Complex Claim Verification

  • 2024-08-06 18:58:53
  • Alon Jacovi, Moran Ambar, Eyal Ben-David, Uri Shaham, Amir Feder, Mor Geva, Dror Marcus, Avi Caciularu
  • 0

Abstract

There is a growing line of research on verifying the correctness of languagemodels' outputs. At the same time, LMs are being used to tackle complex queriesthat require reasoning. We introduce CoverBench, a challenging benchmarkfocused on verifying LM outputs in complex reasoning settings. Datasets thatcan be used for this purpose are often designed for other complex reasoningtasks (e.g., QA) targeting specific use-cases (e.g., financial tables),requiring transformations, negative sampling and selection of hard examples tocollect such a benchmark. CoverBench provides a diversified evaluation forcomplex claim verification in a variety of domains, types of reasoning,relatively long inputs, and a variety of standardizations, such as multiplerepresentations for tables where available, and a consistent schema. Wemanually vet the data for quality to ensure low levels of label noise. Finally,we report a variety of competitive baseline results to show CoverBench ischallenging and has very significant headroom. The data is available athttps://huggingface.co/datasets/google/coverbench .

 

Quick Read (beta)

loading the full paper ...