Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation

Guardat en:
Dades bibliogràfiques
Publicat a:arXiv.org (Dec 18, 2024), p. n/a
Autor principal: Steenhoek, Benjamin
Altres autors: Tufano, Michele, Sundaresan, Neel, Svyatkovskiy, Alexey
Publicat:
Cornell University Library, arXiv.org
Matèries:
Accés en línia:Citation/Abstract
Full text outside of ProQuest
Etiquetes: Afegir etiqueta
Sense etiquetes, Sigues el primer a etiquetar aquest registre!

MARC

LEADER 00000nab a2200000uu 4500
001 3147568761
003 UK-CbPIL
022 |a 2331-8422 
035 |a 3147568761 
045 0 |b d20241218 
100 1 |a Steenhoek, Benjamin 
245 1 |a Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation 
260 |b Cornell University Library, arXiv.org  |c Dec 18, 2024 
513 |a Working Paper 
520 3 |a Software testing is a crucial but time-consuming aspect of software development, and recently, Large Language Models (LLMs) have gained popularity for automated test case generation. However, because LLMs are trained on vast amounts of open-source code, they often generate test cases that do not adhere to best practices and may even contain test smells (anti-patterns). To address this issue, we propose Reinforcement Learning from Static Quality Metrics (RLSQM), wherein we utilize Reinforcement Learning to generate high-quality unit tests based on static analysis-based quality metrics. First, we analyzed LLM-generated tests and show that LLMs frequently do generate undesirable test smells -- up to 37% of the time. Then, we implemented lightweight static analysis-based reward model and trained LLMs using this reward model to optimize for five code quality metrics. Our experimental results demonstrate that the RL-optimized Codex model consistently generated higher-quality test cases than the base LLM, improving quality metrics by up to 23%, and generated nearly 100% syntactically-correct code. RLSQM also outperformed GPT-4 on all code quality metrics, in spite of training a substantially cheaper Codex model. We provide insights into how reliably utilize RL to improve test generation quality and show that RLSQM is a significant step towards enhancing the overall efficiency and reliability of automated software testing. Our data are available at https://doi.org/10.6084/m9.figshare.25983166. 
653 |a Software reliability 
653 |a Best practice 
653 |a Source code 
653 |a Static code analysis 
653 |a Large language models 
653 |a Automation 
653 |a Software development 
653 |a Software testing 
700 1 |a Tufano, Michele 
700 1 |a Sundaresan, Neel 
700 1 |a Svyatkovskiy, Alexey 
773 0 |t arXiv.org  |g (Dec 18, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3147568761/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2412.14308