Holistic Evaluation of Language Models

  • 2022-11-16 18:51:34
  • Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher RĂ©, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, Yuta Koreeda
  • 83

Abstract

Language models (LMs) are becoming the foundation for almost all majorlanguage technologies, but their capabilities, limitations, and risks are notwell understood. We present Holistic Evaluation of Language Models (HELM) toimprove the transparency of language models. First, we taxonomize the vastspace of potential scenarios (i.e. use cases) and metrics (i.e. desiderata)that are of interest for LMs. Then we select a broad subset based on coverageand feasibility, noting what's missing or underrepresented (e.g. questionanswering for neglected English dialects, metrics for trustworthiness). Second,we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration,robustness, fairness, bias, toxicity, and efficiency) for each of 16 corescenarios when possible (87.5% of the time). This ensures metrics beyondaccuracy don't fall to the wayside, and that trade-offs are clearly exposed. Wealso perform 7 targeted evaluations, based on 26 targeted scenarios, to analyzespecific aspects (e.g. reasoning, disinformation). Third, we conduct alarge-scale evaluation of 30 prominent language models (spanning open,limited-access, and closed models) on all 42 scenarios, 21 of which were notpreviously used in mainstream LM evaluation. Prior to HELM, models on averagewere evaluated on just 17.9% of the core HELM scenarios, with some prominentmodels not sharing a single scenario in common. We improve this to 96.0%: nowall 30 models have been densely benchmarked on the same core scenarios andmetrics under standardized conditions. Our evaluation surfaces 25 top-levelfindings. For full transparency, we release all raw model prompts andcompletions publicly for further analysis, as well as a general modulartoolkit. We intend for HELM to be a living benchmark for the community,continuously updated with new scenarios, metrics, and models.

 

Quick Read (beta)

loading the full paper ...