Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation

Abstract

Software testing is a crucial aspect of software development, and thecreation of high-quality tests that adhere to best practices is essential foreffective maintenance. Recently, Large Language Models (LLMs) have gainedpopularity for code generation, including the automated creation of test cases.However, these LLMs are often trained on vast amounts of publicly availablecode, which may include test cases that do not adhere to best practices and mayeven contain test smells (anti-patterns). To address this issue, we propose anovel technique called Reinforcement Learning from Static Quality Metrics(RLSQM). To begin, we analyze the anti-patterns generated by the LLM and showthat LLMs can generate undesirable test smells. Thus, we train specific rewardmodels for each static quality metric, then utilize Proximal PolicyOptimization (PPO) to train models for optimizing a single quality metric at atime. Furthermore, we amalgamate these rewards into a unified reward modelaimed at capturing different best practices and quality aspects of tests. Bycomparing RL-trained models with those trained using supervised learning, weprovide insights into how reliably utilize RL to improve test generationquality and into the effects of various training strategies. Our experimentalresults demonstrate that the RL-optimized model consistently generatedhigh-quality test cases compared to the base LLM, improving the model by up to21%, and successfully generates nearly 100% syntactically correct code. RLSQMalso outperformed GPT-4 on four out of seven metrics. This represents asignificant step towards enhancing the overall efficiency and reliability ofsoftware testing through Reinforcement Learning and static quality metrics. Ourdata are available at https://figshare.com/s/ded476c8d4c221222849.

Quick Read (beta)

loading the full paper ...