Evaluating language models as risk scores

Abstract

Current question-answering benchmarks predominantly focus on accuracy inrealizable prediction tasks. Conditioned on a question and answer-key, does themost likely token match the ground truth? Such benchmarks necessarily fail toevaluate language models' ability to quantify outcome uncertainty. In thiswork, we focus on the use of language models as risk scores for unrealizableprediction tasks. We introduce folktexts, a software package to systematicallygenerate risk scores using language models, and evaluate them against US Censusdata products. A flexible API enables the use of different prompting schemes,local or web-hosted models, and diverse census columns that can be used tocompose custom prediction tasks. We demonstrate the utility of folktextsthrough a sweep of empirical insights into the statistical properties of 17recent large language models across five natural text benchmark tasks. We findthat zero-shot risk scores produced by multiple-choice question-answering havehigh predictive signal but are widely miscalibrated. Base models consistentlyoverestimate outcome uncertainty, while instruction-tuned models underestimateuncertainty and produce over-confident risk scores. In fact, instruction-tuningpolarizes answer distribution regardless of true underlying data uncertainty.Conversely, verbally querying models for probability estimates results insubstantially improved calibration across all instruction-tuned models. Thesedifferences in ability to quantify data uncertainty cannot be revealed inrealizable settings, and highlight a blind-spot in the current evaluationecosystem that \folktexts covers.

Quick Read (beta)

loading the full paper ...