Scaling Language Models: Methods, Analysis & Insights from Training Gopher

  • 2022-01-21 18:39:38
  • Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz,
  • 0

Abstract

Language modelling provides a step towards intelligent communication systemsby harnessing large repositories of written human knowledge to better predictand understand the world. In this paper, we present an analysis ofTransformer-based language model performance across a wide range of modelscales -- from models with tens of millions of parameters up to a 280 billionparameter model called Gopher. These models are evaluated on 152 diverse tasks,achieving state-of-the-art performance across the majority. Gains from scaleare largest in areas such as reading comprehension, fact-checking, and theidentification of toxic language, but logical and mathematical reasoning seeless benefit. We provide a holistic analysis of the training dataset andmodel's behaviour, covering the intersection of model scale with bias andtoxicity. Finally we discuss the application of language models to AI safetyand the mitigation of downstream harms.

 

Quick Read (beta)

loading the full paper ...