AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities acrossvarious domains, with code generation emerging as a key area of focus. Whilenumerous benchmarks have been proposed to evaluate their code generationabilities, these benchmarks face several critical limitations. First, theyoften rely on manual annotations, which are time-consuming and difficult toscale across different programming languages and problem complexities. Second,most existing benchmarks focus primarily on Python, while the few multilingualbenchmarks suffer from limited difficulty and uneven language distribution. Toaddress these challenges, we propose AutoCodeGen, an automated method forgenerating high-difficulty multilingual code generation datasets without manualannotations. AutoCodeGen ensures the correctness and completeness of test casesby generating test inputs with LLMs and obtaining test outputs through amultilingual sandbox, while achieving high data quality through reverse-orderproblem generation and multiple filtering steps. Using this novel method, weintroduce AutoCodeBench, a large-scale code generation benchmark comprising3,920 problems evenly distributed across 20 programming languages. It isspecifically designed to evaluate LLMs on challenging, diverse, and practicalmultilingual tasks. We evaluate over 30 leading open-source and proprietaryLLMs on AutoCodeBench and its simplified version AutoCodeBench-Lite. Theresults show that even the most advanced LLMs struggle with the complexity,diversity, and multilingual nature of these tasks. Besides, we introduceAutoCodeBench-Complete, specifically designed for base models to assess theirfew-shot code generation capabilities. We hope the AutoCodeBench series willserve as a valuable resource and inspire the community to focus on morechallenging and practical multilingual code generation scenarios.

Quick Read (beta)

loading the full paper ...