Code Generation With Large Language Models: Inductive Reasoning and Calibration

Furkejuvvon:

Bibliográfalaš dieđut
Publikašuvnnas:	ProQuest Dissertations and Theses (2025)
Váldodahkki:	Li, Wen-Ding
Almmustuhtton:	ProQuest Dissertations & Theses
Fáttát:	Computer science Artificial intelligence Computer engineering
Liŋkkat:	Citation/Abstract Full Text - PDF
Fáddágilkorat:	Lasit fáddágilkoriid Eai fáddágilkorat, Lasit vuosttaš fáddágilkora!

MARC


LEADER	00000nab a2200000uu 4500
001	3248434448
003	UK-CbPIL
020			\|a 9798293822805
035			\|a 3248434448
045	2		\|b d20250101 \|b d20251231
084			\|a 66569 \|2 nlm
100	1		\|a Li, Wen-Ding
245	1		\|a Code Generation With Large Language Models: Inductive Reasoning and Calibration
260			\|b ProQuest Dissertations & Theses \|c 2025
513			\|a Dissertation/Thesis
520	3		\|a Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, including code generation. However, complex inductive reasoning, deriving general rules from limited observations, remains a significant challenge. Programming-by-Examples (PBE) aims to synthesize programs from input-output examples, representing an important inductive reasoning task in programming languages with practical applications. We propose an approach to enhance LLMs on PBE using code-grounded synthetic data generation to provide high-quality training data for finetuning LLMs and address the scarcity of domain-specific data. Furthermore, we demonstrate how scaling test-time computation significantly improves inference results in this PBE setting. Our approach achieves state-of-the-art results on common PBE benchmarks including string, number sequence, and logo graphics domains. We further extend our methods to ARC-AGI, a very challenging benchmark requiring visual inductive reasoning from a few examples involving concepts such as physics, objects and symmetry. By applying our synthetic data and test-time scaling method, and then combining with transduction, we can approach human-level performance on ARC-AGI, demonstrating the framework's effectiveness even in highly challenging, visually-grounded domains. Unlike PBE and ARC-AGI tasks where examples enable direct validation, real-world code generation often begins with ambiguous natural language specifications. This inherent ambiguity creates uncertainty about code correctness. We develop an approach that samples both code and tests from LLMs and uses execution results to build a classifier that estimates correctness probabilities. The method produces human-interpretable predicates explaining code behavior, a feature that users preferred in the user study, and helps create more trustworthy program synthesis while maintaining state-of-the-art accuracy.
653			\|a Computer science
653			\|a Artificial intelligence
653			\|a Computer engineering
773	0		\|t ProQuest Dissertations and Theses \|g (2025)
786	0		\|d ProQuest \|t ProQuest Dissertations & Theses Global
856	4	1	\|3 Citation/Abstract \|u https://www.proquest.com/docview/3248434448/abstract/embedded/7BTGNMKEMPT1V9Z2?source=fedsrch
856	4	0	\|3 Full Text - PDF \|u https://www.proquest.com/docview/3248434448/fulltextPDF/embedded/7BTGNMKEMPT1V9Z2?source=fedsrch