Neural network language models can serve as computational hypotheses abouthow humans process language. We compared the model-human consistency of diverselanguage models using a novel experimental approach: controversial sentencepairs. For each controversial sentence pair, two language models disagree aboutwhich sentence is more likely to occur in natural text. Considering ninelanguage models (including n-gram, recurrent neural networks, and transformermodels), we created hundreds of such controversial sentence pairs by eitherselecting sentences from a corpus or synthetically optimizing sentence pairs tobe highly controversial. Human subjects then provided judgments indicating foreach pair which of the two sentences is more likely. Controversial sentencepairs proved highly effective at revealing model failures and identifyingmodels that aligned most closely with human judgments. The mosthuman-consistent model tested was GPT-2, although experiments also revealedsignificant shortcomings of its alignment with human perception.