Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation

Abstract

Software testing is a crucial but time-consuming aspect of softwaredevelopment, and recently, Large Language Models (LLMs) have gained popularityfor automated test case generation. However, because LLMs are trained on vastamounts of open-source code, they often generate test cases that do not adhereto best practices and may even contain test smells (anti-patterns). To addressthis issue, we propose Reinforcement Learning from Static Quality Metrics(RLSQM), wherein we utilize Reinforcement Learning to generate high-qualityunit tests based on static analysis-based quality metrics. First, we analyzedLLM-generated tests and show that LLMs frequently do generate undesirable testsmells -- up to 37% of the time. Then, we implemented lightweight staticanalysis-based reward model and trained LLMs using this reward model tooptimize for five code quality metrics. Our experimental results demonstratethat the RL-optimized Codex model consistently generated higher-quality testcases than the base LLM, improving quality metrics by up to 23%, and generatednearly 100% syntactically-correct code. RLSQM also outperformed GPT-4 on allcode quality metrics, in spite of training a substantially cheaper Codex model.We provide insights into how reliably utilize RL to improve test generationquality and show that RLSQM is a significant step towards enhancing the overallefficiency and reliability of automated software testing. Our data areavailable at https://doi.org/10.6084/m9.figshare.25983166.

Quick Read (beta)

loading the full paper ...