Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations

Abstract

Language model users often issue queries that lack specification, where thecontext under which a query was issued -- such as the user's identity, thequery's intent, and the criteria for a response to be useful -- is notexplicit. For instance, a good response to a subjective query like "What bookshould I read next?" would depend on the user's preferences, and a goodresponse to an open-ended query like "How do antibiotics work againstbacteria?" would depend on the user's expertise. This makes evaluation ofresponses to such queries an ill-posed task, as evaluators may make arbitraryjudgments about the response quality. To remedy this, we present contextualizedevaluations, a protocol that synthetically constructs context surrounding anunderspecified query and provides it during evaluation. We find that thepresence of context can 1) alter conclusions drawn from evaluation, evenflipping win rates between model pairs, 2) nudge evaluators to make fewerjudgments based on surface-level criteria, like style, and 3) provide newinsights about model behavior across diverse contexts. Specifically, ourprocedure uncovers an implicit bias towards WEIRD contexts in models' "default"responses and we find that models are not equally sensitive to followingdifferent contexts, even when they are provided in prompts.

Quick Read (beta)

loading the full paper ...