Helping LLMs Improve Code Generation Using Feedback from Testing and Static Analysis

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org (Dec 19, 2024), p. n/a
1. Verfasser:	Dolcetti, Greta
Weitere Verfasser:	Arceri, Vincenzo, Iotti, Eleonora, Maffeis, Sergio, Cortesi, Agostino, Zaffanella, Enea
Veröffentlicht:	Cornell University Library, arXiv.org
Schlagworte:	Source code Static code analysis Large language models Artificial intelligence Feedback Software development
Online-Zugang:	Citation/Abstract Full text outside of ProQuest
Tags:	Tag hinzufügen Keine Tags, Fügen Sie das erste Tag hinzu!

MARC


LEADER	00000nab a2200000uu 4500
001	3147568002
003	UK-CbPIL
022			\|a 2331-8422
035			\|a 3147568002
045	0		\|b d20241219
100	1		\|a Dolcetti, Greta
245	1		\|a Helping LLMs Improve Code Generation Using Feedback from Testing and Static Analysis
260			\|b Cornell University Library, arXiv.org \|c Dec 19, 2024
513			\|a Working Paper
520	3		\|a Large Language Models (LLMs) are one of the most promising developments in the field of artificial intelligence, and the software engineering community has readily noticed their potential role in the software development life-cycle. Developers routinely ask LLMs to generate code snippets, increasing productivity but also potentially introducing ownership, privacy, correctness, and security issues. Previous work highlighted how code generated by mainstream commercial LLMs is often not safe, containing vulnerabilities, bugs, and code smells. In this paper, we present a framework that leverages testing and static analysis to assess the quality, and guide the self-improvement, of code generated by general-purpose, open-source LLMs. First, we ask LLMs to generate C code to solve a number of programming tasks. Then we employ ground-truth tests to assess the (in)correctness of the generated code, and a static analysis tool to detect potential safety vulnerabilities. Next, we assess the models ability to evaluate the generated code, by asking them to detect errors and vulnerabilities. Finally, we test the models ability to fix the generated code, providing the reports produced during the static analysis and incorrectness evaluation phases as feedback. Our results show that models often produce incorrect code, and that the generated code can include safety issues. Moreover, they perform very poorly at detecting either issue. On the positive side, we observe a substantial ability to fix flawed code when provided with information about failed tests or potential vulnerabilities, indicating a promising avenue for improving the safety of LLM-based code generation tools.
653			\|a Source code
653			\|a Static code analysis
653			\|a Large language models
653			\|a Artificial intelligence
653			\|a Feedback
653			\|a Software development
700	1		\|a Arceri, Vincenzo
700	1		\|a Iotti, Eleonora
700	1		\|a Maffeis, Sergio
700	1		\|a Cortesi, Agostino
700	1		\|a Zaffanella, Enea
773	0		\|t arXiv.org \|g (Dec 19, 2024), p. n/a
786	0		\|d ProQuest \|t Engineering Database
856	4	1	\|3 Citation/Abstract \|u https://www.proquest.com/docview/3147568002/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch
856	4	0	\|3 Full text outside of ProQuest \|u http://arxiv.org/abs/2412.14841