Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages

Abstract

Many recent works have explored using language models for planning problems.One line of research focuses on translating natural language descriptions ofplanning tasks into structured planning languages, such as the planning domaindefinition language (PDDL). While this approach is promising, accuratelymeasuring the quality of generated PDDL code continues to pose significantchallenges. First, generated PDDL code is typically evaluated using planningvalidators that check whether the problem can be solved with a planner. Thismethod is insufficient because a language model might generate valid PDDL codethat does not align with the natural language description of the task. Second,existing evaluation sets often have natural language descriptions of theplanning task that closely resemble the ground truth PDDL, reducing thechallenge of the task. To bridge this gap, we introduce \benchmarkName, abenchmark designed to evaluate language models' ability to generate PDDL codefrom natural language descriptions of planning tasks. We begin by creating aPDDL equivalence algorithm that rigorously evaluates the correctness of PDDLcode generated by language models by flexibly comparing it against a groundtruth PDDL. Then, we present a dataset of $132,037$ text-to-PDDL pairs across13 different tasks, with varying levels of difficulty. Finally, we evaluateseveral API-access and open-weight language models that reveal this task'scomplexity. For example, $87.6\%$ of the PDDL problem descriptions generated byGPT-4o are syntactically parseable, $82.2\%$ are valid, solve-able problems,but only $35.1\%$ are semantically correct, highlighting the need for a morerigorous benchmark for this problem.

Quick Read (beta)

loading the full paper ...