CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution

Abstract

Code benchmarks such as HumanEval are widely adopted to evaluate LargeLanguage Models' (LLMs) coding capabilities. However, there is an unignorableprogramming language bias in existing code benchmarks -- over 95% codegeneration benchmarks are dominated by Python, leaving the LLMs' capabilitiesin other programming languages such as Java and C/C++ unknown. Moreover, codingtask bias is also crucial. Most benchmarks focus on code generation capability,while benchmarks for code reasoning (given input, reasoning output; and givenoutput, reasoning input), an essential coding capability, are insufficient.Yet, constructing multi-lingual benchmarks can be expensive andlabor-intensive, and codes in contest websites such as Leetcode suffer fromdata contamination during training. To fill this gap, we propose CRUXEVAL-X, amulti-lingual code reasoning benchmark that contains 19 programming languages.It comprises at least 600 subjects for each language, along with 19Kcontent-consistent tests in total. In particular, the construction pipeline ofCRUXEVAL-X works in a fully automated and test-guided manner, which iterativelygenerates and repairs based on execution feedback. Also, to cross languagebarriers (e.g., dynamic/static type systems in Python/C++), we formulatedvarious transition rules between language pairs to facilitate translation. Ourintensive evaluation of 24 representative LLMs reveals the correlation betweenlanguage pairs. For example, TypeScript and JavaScript show a significantpositive correlation, while Racket has less correlation with other languages.More interestingly, even a model trained solely on Python can achieve at most34.4% Pass@1 in other languages, revealing the cross-language generalization ofLLMs.

Quick Read (beta)

loading the full paper ...