Abstract
Multilingual BERT (mBERT) trained on 104 languages has shown surprisinglygood cross-lingual performance on several NLP tasks, even without explicitcross-lingual signals. However, these evaluations have focused on cross-lingualtransfer with high-resource languages, covering only a third of the languagescovered by mBERT. We explore how mBERT performs on a much wider set oflanguages, focusing on the quality of representation for low-resourcelanguages, measured by within-language performance. We consider three tasks:Named Entity Recognition (99 languages), Part-of-speech Tagging, and DependencyParsing (54 languages each). mBERT does better than or comparable to baselineson high resource languages but does much worse for low resource languages.Furthermore, monolingual BERT models for these languages do even worse. Pairedwith similar languages, the performance gap between monolingual BERT and mBERTcan be narrowed. We find that better models for low resource languages requiremore efficient pretraining techniques or more data.