HoarePrompt: Structural Reasoning About Program Correctness in Natural Language

Abstract

While software requirements are often expressed in natural language,verifying the correctness of a program against natural language requirements isa hard and underexplored problem. Large language models (LLMs) are promisingcandidates for addressing this challenge, however our experience shows thatthey are ineffective in this task, often failing to detect even straightforwardbugs. To address this gap, we introduce HoarePrompt, a novel approach thatadapts fundamental ideas from program analysis and verification to naturallanguage artifacts. Drawing inspiration from the strongest postconditioncalculus, HoarePrompt employs a systematic, step-by-step process in which anLLM generates natural language descriptions of reachable program states atvarious points in the code. To manage loops, we propose few-shot-drivenk-induction, an adaptation of the k-induction method widely used in modelchecking. Once program states are described, HoarePrompt leverages the LLM toassess whether the program, annotated with these state descriptions, conformsto the natural language requirements. For evaluating the quality of classifiersof program correctness with respect to natural language requirements, weconstructed CoCoClaNeL, a challenging dataset of solutions to programmingcompetition problems. Our experiments show that HoarePrompt improves the MCC by62% compared to directly using Zero-shot-CoT prompts for correctnessclassification. Furthermore, HoarePrompt outperforms a classifier that assessescorrectness via LLM-based test generation by increasing the MCC by 93%. Theinductive reasoning mechanism contributes a 28% boost to MCC, underscoring itseffectiveness in managing loops.

Quick Read (beta)

loading the full paper ...