Abstract
The development of Large Language Models (LLMs) relies on extensive textcorpora, which are often unevenly distributed across languages. This imbalanceresults in LLMs performing significantly better on high-resource languages likeEnglish, German, and French, while their capabilities in low-resource languagesremain inadequate. Currently, there is a lack of quantitative methods toevaluate the performance of LLMs in these low-resource languages. To addressthis gap, we propose the Language Ranker, an intrinsic metric designed tobenchmark and rank languages based on LLM performance using internalrepresentations. By comparing the LLM's internal representation of variouslanguages against a baseline derived from English, we can assess the model'smultilingual capabilities in a robust and language-agnostic manner. Ouranalysis reveals that high-resource languages exhibit higher similarity scoreswith English, demonstrating superior performance, while low-resource languagesshow lower similarity scores, underscoring the effectiveness of our metric inassessing language-specific capabilities. Besides, the experiments show thatthere is a strong correlation between the LLM's performance in differentlanguages and the proportion of those languages in its pre-training corpus.These insights underscore the efficacy of the Language Ranker as a tool forevaluating LLM performance across different languages, particularly those withlimited resources.