Abstract
Current benchmarks for evaluating neural code models focus on only a smallsubset of programming languages, excluding many popular languages such as Go orRust. To ameliorate this issue, we present the BabelCode framework forexecution-based evaluation of any benchmark in any language. BabelCode enablesnew investigations into the qualitative performance of models' memory, runtime,and individual test case results. Additionally, we present a new codetranslation dataset called Translating Python Programming Puzzles (TP3) fromthe Python Programming Puzzles (Schuster et al. 2021) benchmark that involvestranslating expert-level python functions to any language. With both BabelCodeand the TP3 benchmark, we investigate if balancing the distributions of 14languages in a training dataset improves a large language model's performanceon low-resource languages. Training a model on a balanced corpus results in, onaverage, 12.34% higher [email protected]$ across all tasks and languages compared to thebaseline. We find that this strategy achieves 66.48% better [email protected]$ onlow-resource languages at the cost of only a 12.94% decrease to high-resourcelanguages. In our three translation tasks, this strategy yields, on average,30.77% better low-resource [email protected]$ while having 19.58% worse [email protected]$.