Abstract
Large Language Models (LLMs) have achieved state-of-the-art performanceacross numerous tasks. However, these advancements have predominantly benefited"first-class" languages such as English and Chinese, leaving many otherlanguages underrepresented. This imbalance, while limiting broaderapplications, generates a natural preference ranking between languages,offering an opportunity to bootstrap the multilingual capabilities of LLM in aself-improving manner. Thus, we propose $\textit{Language Imbalance DrivenRewarding}$, where the inherent imbalance between dominant and non-dominantlanguages within LLMs is leveraged as a reward signal. Iterative DPO trainingdemonstrates that this approach not only enhances LLM performance innon-dominant languages but also improves the dominant language's capacity,thereby yielding an iterative reward signal. Fine-tuningMeta-Llama-3-8B-Instruct over two iterations of this approach results incontinuous improvements in multilingual performance acrossinstruction-following and arithmetic reasoning tasks, evidenced by an averageimprovement of 7.46% win rate on the X-AlpacaEval leaderboard and 13.9%accuracy on the MGSM benchmark. This work serves as an initial exploration,paving the way for multilingual self-improvement of LLMs.