Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation

Guardat en:

Dades bibliogràfiques
Publicat a:	arXiv.org (Dec 18, 2024), p. n/a
Autor principal:	Steenhoek, Benjamin
Altres autors:	Tufano, Michele, Sundaresan, Neel, Svyatkovskiy, Alexey
Publicat:	Cornell University Library, arXiv.org
Matèries:	Software reliability Best practice Source code Static code analysis Large language models Automation Software development Software testing
Accés en línia:	Citation/Abstract Full text outside of ProQuest
Etiquetes:	Afegir etiqueta Sense etiquetes, Sigues el primer a etiquetar aquest registre!

MARC


LEADER	00000nab a2200000uu 4500
001	3147568761
003	UK-CbPIL
022			\|a 2331-8422
035			\|a 3147568761
045	0		\|b d20241218
100	1		\|a Steenhoek, Benjamin
245	1		\|a Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation
260			\|b Cornell University Library, arXiv.org \|c Dec 18, 2024
513			\|a Working Paper
520	3		\|a Software testing is a crucial but time-consuming aspect of software development, and recently, Large Language Models (LLMs) have gained popularity for automated test case generation. However, because LLMs are trained on vast amounts of open-source code, they often generate test cases that do not adhere to best practices and may even contain test smells (anti-patterns). To address this issue, we propose Reinforcement Learning from Static Quality Metrics (RLSQM), wherein we utilize Reinforcement Learning to generate high-quality unit tests based on static analysis-based quality metrics. First, we analyzed LLM-generated tests and show that LLMs frequently do generate undesirable test smells -- up to 37% of the time. Then, we implemented lightweight static analysis-based reward model and trained LLMs using this reward model to optimize for five code quality metrics. Our experimental results demonstrate that the RL-optimized Codex model consistently generated higher-quality test cases than the base LLM, improving quality metrics by up to 23%, and generated nearly 100% syntactically-correct code. RLSQM also outperformed GPT-4 on all code quality metrics, in spite of training a substantially cheaper Codex model. We provide insights into how reliably utilize RL to improve test generation quality and show that RLSQM is a significant step towards enhancing the overall efficiency and reliability of automated software testing. Our data are available at https://doi.org/10.6084/m9.figshare.25983166.
653			\|a Software reliability
653			\|a Best practice
653			\|a Source code
653			\|a Static code analysis
653			\|a Large language models
653			\|a Automation
653			\|a Software development
653			\|a Software testing
700	1		\|a Tufano, Michele
700	1		\|a Sundaresan, Neel
700	1		\|a Svyatkovskiy, Alexey
773	0		\|t arXiv.org \|g (Dec 18, 2024), p. n/a
786	0		\|d ProQuest \|t Engineering Database
856	4	1	\|3 Citation/Abstract \|u https://www.proquest.com/docview/3147568761/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch
856	4	0	\|3 Full text outside of ProQuest \|u http://arxiv.org/abs/2412.14308