Abstract
There is a growing line of research on verifying the correctness of languagemodels' outputs. At the same time, LMs are being used to tackle complex queriesthat require reasoning. We introduce CoverBench, a challenging benchmarkfocused on verifying LM outputs in complex reasoning settings. Datasets thatcan be used for this purpose are often designed for other complex reasoningtasks (e.g., QA) targeting specific use-cases (e.g., financial tables),requiring transformations, negative sampling and selection of hard examples tocollect such a benchmark. CoverBench provides a diversified evaluation forcomplex claim verification in a variety of domains, types of reasoning,relatively long inputs, and a variety of standardizations, such as multiplerepresentations for tables where available, and a consistent schema. Wemanually vet the data for quality to ensure low levels of label noise. Finally,we report a variety of competitive baseline results to show CoverBench ischallenging and has very significant headroom. The data is available athttps://huggingface.co/datasets/google/coverbench .