Measuring The Impact Of Programming Language Distribution

  • 2023-03-15 15:36:49
  • Gabriel Orlanski, Kefan Xiao, Xavier Garcia, Jeffrey Hui, Joshua Howland, Jonathan Malmaud, Jacob Austin, Rishah Singh, Michele Catasta
  • 0

Abstract

Current benchmarks for evaluating neural code models focus on only a smallsubset of programming languages, excluding many popular languages such as Go orRust. To ameliorate this issue, we present the BabelCode framework forexecution-based evaluation of any benchmark in any language. BabelCode enablesnew investigations into the qualitative performance of models' memory, runtime,and individual test case results. Additionally, we present a new codetranslation dataset called Translating Python Programming Puzzles (TP3) fromthe Python Programming Puzzles (Schuster et al. 2021) benchmark that involvestranslating expert-level python functions to any language. With both BabelCodeand the TP3 benchmark, we investigate if balancing the distributions of 14languages in a training dataset improves a large language model's performanceon low-resource languages. Training a model on a balanced corpus results in, onaverage, 12.34% higher [email protected]$ across all tasks and languages compared to thebaseline. We find that this strategy achieves 66.48% better [email protected]$ onlow-resource languages at the cost of only a 12.94% decrease to high-resourcelanguages. In our three translation tasks, this strategy yields, on average,30.77% better low-resource [email protected]$ while having 19.58% worse [email protected]$.

 

Quick Read (beta)

loading the full paper ...