From Policy Optimization Foundations to Language Model Post-Training on Structured Tasks

Kaydedildi:

Detaylı Bibliyografya
Yayımlandı:	ProQuest Dissertations and Theses (2025)
Yazar:	Liu, Boyi
Baskı/Yayın Bilgisi:	ProQuest Dissertations & Theses
Konular:	Computer science Artificial intelligence Information technology
Online Erişim:	Citation/Abstract Full Text - PDF
Etiketler:	Etiketle Etiket eklenmemiş, İlk siz ekleyin!

Diğer Bilgiler
Özet:	Reinforcement learning for large models is constrained in three practical ways that this dissertation addresses in sequence. First, we study policy optimization from off-policy data and show how estimating the density ratio (via a learned behavior policy) reduces the variance of importance-weighted objectives. This estimation step is not only central to off-policy bandits; it also underpins PPO/TRPO, whose hybrid update pattern performs multiple policy-improvement steps per batch and is therefore partially off-policy. Second, we establish a theoretical foundation for PPO/TRPO under high-capacity function approximation, proving global convergence with overparameterized neural critics and actors and quantifying the cost of policy evaluation/improvement per outer iteration. Third, we move beyond algorithmic foundations to an application in language-model post-training: for structured tasks such as text-to-SQL, we resolve reward scarcity by exploiting task structure to build execution-free reward models, enabling RL at the scale of SFT corpora that lack executable databases.The second part, Global Convergence of Neural Trust-Region / Proximal Policy Optimization, turns to theoretical insights into the most popular online RL algorithm in both the classic settings and the language model post-training. We analyze a variant of PPO/TRPO in which both the actor and the critic are overparametrized two-layer neural networks. We show that the algorithm converges to the globally optimal policy at a sublinear rate O(1/ √ K) in the number of outer policy-improvement iterations, and that each iteration admits polynomial-time policy evaluation and policy improvement: O(1/ε2 ) TD and SGD steps suffice to keep the approximation errors within the constants of the outer rate. This closes the gap between the practical PPO-style updates used in modern systems and a nonasymptotic convergence guarantee under expressive models. papers.The third part, Execution-Free RL for Structured Tasks in Language Model Post-Training, tackles the reward-availability problem that arises in RL post-training of LLMs. In current text-to-SQL corpora, the main cost is not running SQL, but constructing or curating the database and test suites needed to execute and compare generated queries; most labeled text–SQL pairs simply do not come with such databases. We introduce a graph-based evaluation metric (FuncEval-GMN) that parses SQL into relational operator trees using only the schema and then predicts functional equivalence with a graph matching network, thereby removing the need to build per-example databases and achieving higher AUC than exact-set or execution-based metrics on Spider and competitive accuracy on WikiSQL/BIRD. Building on this evaluator, we develop Graph-Reward-SQL, an execution-free RL fine-tuning framework that supplies GMN-based outcome rewards and stepwise rewards over CTEs; on Spider and BIRD it consistently outperforms execution-based and LLM-based reward models while cutting inference time and GPU usage, making RL feasible at SFT scale.Scope. Parts I and II were conducted at Northwestern University, where this author contributed substantively to the problem formulation, methodology, and theoretical analysis. Part III was completed at ByteDance; in that work, this author focused more on project guidance and oversight, including problem scoping, methodological review, and advising of experiment designs and implementations.
ISBN:	9798270226558
Kaynak:	ProQuest Dissertations & Theses Global