Do AI models help produce verified bug fixes?

Abstract

Among areas of software engineering where AI techniques -- particularly,Large Language Models -- seem poised to yield dramatic improvements, anattractive candidate is Automatic Program Repair (APR), the production ofsatisfactory corrections to software bugs. Does this expectation materialize inpractice? How do we find out, making sure that proposed corrections actuallywork? If programmers have access to LLMs, how do they actually use them tocomplement their own skills? To answer these questions, we took advantage of the availability of aprogram-proving environment, which formally determines the correctness ofproposed fixes, to conduct a study of program debugging with two randomlyassigned groups of programmers, one with access to LLMs and the other without,both validating their answers through the proof tools. The methodology reliedon a division into general research questions (Goals in the Goal-Query-Metricapproach), specific elements admitting specific answers (Queries), andmeasurements supporting these answers (Metrics). While applied so far to alimited sample size, the results are a first step towards delineating a properrole for AI and LLMs in providing guaranteed-correct fixes to program bugs. These results caused surprise as compared to what one might expect from theuse of AI for debugging and APR. The contributions also include: a detailedmethodology for experiments in the use of LLMs for debugging, which otherprojects can reuse; a fine-grain analysis of programmer behavior, made possibleby the use of full-session recording; a definition of patterns of use of LLMs,with 7 distinct categories; and validated advice for getting the best of LLMsfor debugging and Automatic Program Repair.

Quick Read (beta)

loading the full paper ...