Abstract
Code benchmarks such as HumanEval are widely adopted to evaluate LargeLanguage Models' (LLMs) coding capabilities. However, there is an unignorableprogramming language bias in existing code benchmarks -- over 95% codegeneration benchmarks are dominated by Python, leaving the LLMs' capabilitiesin other programming languages such as Java and C/C++ unknown. Moreover, codingtask bias is also crucial. Most benchmarks focus on code generation capability,while benchmarks for code reasoning (given input, reasoning output; and givenoutput, reasoning input), an essential coding capability, are insufficient.Yet, constructing multi-lingual benchmarks can be expensive andlabor-intensive, and codes in contest websites such as Leetcode suffer fromdata contamination during training. To fill this gap, we propose CRUXEVAL-X, amulti-lingual code reasoning benchmark that contains 19 programming languages.It comprises at least 600 subjects for each language, along with 19Kcontent-consistent tests in total. In particular, the construction pipeline ofCRUXEVAL-X works in a fully automated and test-guided manner, which iterativelygenerates and repairs based on execution feedback. Also, to cross languagebarriers (e.g., dynamic/static type systems in Python/C++), we formulatedvarious transition rules between language pairs to facilitate translation. Ourintensive evaluation of 24 representative LLMs reveals the correlation betweenlanguage pairs. For example, TypeScript and JavaScript show a significantpositive correlation, while Racket has less correlation with other languages.More interestingly, even a model trained solely on Python can achieve at most34.4% Pass@1 in other languages, revealing the cross-language generalization ofLLMs.