Code Generation With Large Language Models: Inductive Reasoning and Calibration

Furkejuvvon:
Bibliográfalaš dieđut
Publikašuvnnas:ProQuest Dissertations and Theses (2025)
Váldodahkki: Li, Wen-Ding
Almmustuhtton:
ProQuest Dissertations & Theses
Fáttát:
Liŋkkat:Citation/Abstract
Full Text - PDF
Fáddágilkorat: Lasit fáddágilkoriid
Eai fáddágilkorat, Lasit vuosttaš fáddágilkora!

MARC

LEADER 00000nab a2200000uu 4500
001 3248434448
003 UK-CbPIL
020 |a 9798293822805 
035 |a 3248434448 
045 2 |b d20250101  |b d20251231 
084 |a 66569  |2 nlm 
100 1 |a Li, Wen-Ding 
245 1 |a Code Generation With Large Language Models: Inductive Reasoning and Calibration 
260 |b ProQuest Dissertations & Theses  |c 2025 
513 |a Dissertation/Thesis 
520 3 |a Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, including code generation. However, complex inductive reasoning, deriving general rules from limited observations, remains a significant challenge. Programming-by-Examples (PBE) aims to synthesize programs from input-output examples, representing an important inductive reasoning task in programming languages with practical applications. We propose an approach to enhance LLMs on PBE using code-grounded synthetic data generation to provide high-quality training data for finetuning LLMs and address the scarcity of domain-specific data. Furthermore, we demonstrate how scaling test-time computation significantly improves inference results in this PBE setting. Our approach achieves state-of-the-art results on common PBE benchmarks including string, number sequence, and logo graphics domains. We further extend our methods to ARC-AGI, a very challenging benchmark requiring visual inductive reasoning from a few examples involving concepts such as physics, objects and symmetry. By applying our synthetic data and test-time scaling method, and then combining with transduction, we can approach human-level performance on ARC-AGI, demonstrating the framework's effectiveness even in highly challenging, visually-grounded domains. Unlike PBE and ARC-AGI tasks where examples enable direct validation, real-world code generation often begins with ambiguous natural language specifications. This inherent ambiguity creates uncertainty about code correctness. We develop an approach that samples both code and tests from LLMs and uses execution results to build a classifier that estimates correctness probabilities. The method produces human-interpretable predicates explaining code behavior, a feature that users preferred in the user study, and helps create more trustworthy program synthesis while maintaining state-of-the-art accuracy. 
653 |a Computer science 
653 |a Artificial intelligence 
653 |a Computer engineering 
773 0 |t ProQuest Dissertations and Theses  |g (2025) 
786 0 |d ProQuest  |t ProQuest Dissertations & Theses Global 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3248434448/abstract/embedded/7BTGNMKEMPT1V9Z2?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3248434448/fulltextPDF/embedded/7BTGNMKEMPT1V9Z2?source=fedsrch