Language Model Evaluation Beyond Perplexity

Abstract

We propose an alternate approach to quantifying how well language modelslearn natural language: we ask how well they match the statistical tendenciesof natural language. To answer this question, we analyze whether text generatedfrom language models exhibits the statistical tendencies present in thehuman-generated text on which they were trained. We provide a framework--pairedwith significance tests--for evaluating the fit of language models to certainstatistical tendencies of natural language. We find that neural language modelsappear to learn only a subset of the statistical tendencies considered, butalign much more closely with empirical trends than theoretical laws (whenpresent). Further, the fit to different distributions is dependent on bothmodel architecture and generation strategy. As concrete examples, textgenerated under the nucleus sampling scheme adheres more closely to thetype--token relationship of natural language than text produced using standardancestral sampling; text from LSTMs reflects the natural language distributionsover length, stopwords, and symbols suprisingly well.

Quick Read (beta)

loading the full paper ...