An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation

Abstract

Unit tests play a key role in ensuring the correctness of software. However,manually creating unit tests is a laborious task, motivating the need forautomation. Large Language Models (LLMs) have recently been applied to thisproblem, utilizing additional training or few-shot learning on examples ofexisting tests. This paper presents a large-scale empirical evaluation on theeffectiveness of LLMs for automated unit test generation without additionaltraining or manual effort, providing the LLM with the signature andimplementation of the function under test, along with usage examples extractedfrom documentation. We also attempt to repair failed generated tests byre-prompting the model with the failing test and error message. We implementour approach in TestPilot, a test generation tool for JavaScript thatautomatically generates unit tests for all API functions in an npm package. Weevaluate TestPilot using OpenAI's gpt3.5-turbo LLM on 25 npm packages with atotal of 1,684 API functions. The generated tests achieve a median statementcoverage of 70.2% and branch coverage of 52.8%, significantly improving onNessie, a recent feedback-directed JavaScript test generation technique, whichachieves only 51.3% statement coverage and 25.6% branch coverage. We also findthat 92.8% of TestPilot's generated tests have no more than 50% similarity withexisting tests (as measured by normalized edit distance), with none of thembeing exact copies. Finally, we run TestPilot with two additional LLMs,OpenAI's older code-cushman-002 LLM and the open LLM StarCoder. Overall, weobserved similar results with the former (68.2% median statement coverage), andsomewhat worse results with the latter (54.0% median statement coverage),suggesting that the effectiveness of the approach is influenced by the size andtraining set of the LLM, but does not fundamentally depend on the specificmodel.

Quick Read (beta)

loading the full paper ...