BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space Complexity?

Abstract

We introduce BigO(Bench), a novel coding benchmark designed to evaluate thecapabilities of generative language models in understanding and generating codewith specified time and space complexities. This benchmark addresses the gap incurrent evaluations that often overlook the ability of models to comprehend andproduce code constrained by computational complexity. BigO(Bench) includestooling to infer the algorithmic complexity of any Python function fromprofiling measurements, including human- or LLM-generated solutions.BigO(Bench) also includes of set of 3,105 coding problems and 1,190,250solutions from Code Contests annotated with inferred (synthetic) time and spacecomplexity labels from the complexity framework, as well as correspondingruntime and memory footprint values for a large set of input sizes. We presentresults from evaluating multiple state-of-the-art language models on thisbenchmark, highlighting their strengths and weaknesses in handling complexityrequirements. In particular, token-space reasoning models are unrivaled in codegeneration but not in complexity understanding, hinting that they may notgeneralize well to tasks for which no reward was given at training time.

Quick Read (beta)

loading the full paper ...