What Kind of Language Is Hard to Language-Model?

Abstract

How language-agnostic are current state-of-the-art NLP tools? Are there sometypes of language that are easier to model with current methods? In prior work(Cotterell et al., 2018) we attempted to address this question for languagemodeling, and observed that recurrent neural network language models do notperform equally well over all the high-resource European languages found in theEuroparl corpus. We speculated that inflectional morphology may be the primaryculprit for the discrepancy. In this paper, we extend these earlier experimentsto cover 69 languages from 13 language families using a multilingual Biblecorpus. Methodologically, we introduce a new paired-sample multiplicativemixed-effects model to obtain language difficulty coefficients fromat-least-pairwise parallel corpora. In other words, the model is aware ofinter-sentence variation and can handle missing data. Exploiting this model, weshow that "translationese" is not any easier to model than natively writtenlanguage in a fair comparison. Trying to answer the question of what featuresdifficult languages have in common, we try and fail to reproduce our earlier(Cotterell et al., 2018) observation about morphological complexity and insteadreveal far simpler statistics of the data that seem to drive complexity in amuch larger sample.

Quick Read (beta)

loading the full paper ...