Language Model Evaluation Beyond Perplexity

Abstract

We propose an alternate approach to quantifying how well language modelslearn natural language: we ask how well they match the statistical tendenciesof natural language. To answer this question, we analyze whether text generatedfrom language models exhibits the statistical tendencies present in thehuman-generated text on which they were trained. We provide a framework--pairedwith significance tests--for evaluating the fit of language models to thesetrends. We find that neural language models appear to learn only a subset ofthe tendencies considered, but align much more closely with empirical trendsthan proposed theoretical distributions (when present). Further, the fit todifferent distributions is highly-dependent on both model architecture andgeneration strategy. As concrete examples, text generated under the nucleussampling scheme adheres more closely to the type--token relationship of naturallanguage than text produced using standard ancestral sampling; text from LSTMsreflects the natural language distributions over length, stopwords, and symbolssurprisingly well.

Quick Read (beta)

loading the full paper ...