It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners

Abstract

When scaled to hundreds of billions of parameters, pretrained language modelssuch as GPT-3 (Brown et al., 2020) achieve remarkable few-shot performance.However, enormous amounts of compute are required for training and applyingsuch big models, resulting in a large carbon footprint and making it difficultfor researchers and practitioners to use them. We show that performance similarto GPT-3 can be obtained with language models that are much "greener" in thattheir parameter count is several orders of magnitude smaller. This is achievedby converting textual inputs into cloze questions that contain a taskdescription, combined with gradient-based optimization; exploiting unlabeleddata gives further improvements. We identify key factors required forsuccessful natural language understanding with small language models.

Quick Read (beta)

loading the full paper ...