Abstract
Although recent Massively Multilingual Language Models (MMLMs) like mBERT andXLMR support around 100 languages, most existing multilingual NLP benchmarksprovide evaluation data in only a handful of these languages with littlelinguistic diversity. We argue that this makes the existing practices inmultilingual evaluation unreliable and does not provide a full picture of theperformance of MMLMs across the linguistic landscape. We propose that therecent work done in Performance Prediction for NLP tasks can serve as apotential solution in fixing benchmarking in Multilingual NLP by utilizingfeatures related to data and language typology to estimate the performance ofan MMLM on different languages. We compare performance prediction withtranslating test data with a case study on four different multilingualdatasets, and observe that these methods can provide reliable estimates of theperformance that are often on-par with the translation based approaches,without the need for any additional translation as well as evaluation costs.