Scalable Data Paradigms for Steering General-Purpose Language Models

Guardado en:

Detalles Bibliográficos
Publicado en:	ProQuest Dissertations and Theses (2026)
Autor principal:	Wang, Yizhong
Publicado:	ProQuest Dissertations & Theses
Materias:	Artificial intelligence Computer engineering Computer science
Acceso en línea:	Citation/Abstract Full Text - PDF
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

MARC


LEADER	00000nab a2200000uu 4500
001	3230023261
003	UK-CbPIL
020			\|a 9798288834073
035			\|a 3230023261
045	2		\|b d20260101 \|b d20261231
084			\|a 66569 \|2 nlm
100	1		\|a Wang, Yizhong
245	1		\|a Scalable Data Paradigms for Steering General-Purpose Language Models
260			\|b ProQuest Dissertations & Theses \|c 2026
513			\|a Dissertation/Thesis
520	3		\|a Pretrained Language Models (LMs) have demonstrated remarkable general-purpose capabilities by encoding vast amounts of knowledge from the internet. However, effectively steering these models to serve diverse downstream applications, such as following instructions, chatting with users, using tools, or performing complex reasoning, poses another set of challenges that require diverse, high-quality, and increasingly costly training data.This dissertation explores scalable paradigms for structuring, creating, and optimizing data to facilitate the broader generalization of language models and enhance their critical capabilities. First, through the creation of the Super-NaturalInstructions benchmark—a large-scale dataset with over 1,600 NLP tasks—I demonstrate that unifying NLP tasks via natural language instructions enables model generalization at the task level. Second, I propose Self-Instruct, a novel framework where LMs generate their own instructional data to train themselves, thereby demonstrating model self-improvement. Third, I develop HyPER, a framework that routes preference annotation tasks between humans and AI to optimize data quality and collection efficiency for preference-based learning. Finally, I systematically study the impact of diverse open instruction-tuning datasets on LM capabilities, leading to the development of the Tülu series of openly available and highly capable models.Together, these efforts—unifying task structures, leveraging model-generated synthetic data, optimizing human-AI data partnerships, and fostering open data ecosystems—have demonstrated an effective path to building a strong, scalable, and community-driven data foundation for post-training language models. Finally, I envision future directions that can further enhance this data foundation for building more advanced and sustainable AI systems.
653			\|a Artificial intelligence
653			\|a Computer engineering
653			\|a Computer science
773	0		\|t ProQuest Dissertations and Theses \|g (2026)
786	0		\|d ProQuest \|t ProQuest Dissertations & Theses Global
856	4	1	\|3 Citation/Abstract \|u https://www.proquest.com/docview/3230023261/abstract/embedded/H09TXR3UUZB2ISDL?source=fedsrch
856	4	0	\|3 Full Text - PDF \|u https://www.proquest.com/docview/3230023261/fulltextPDF/embedded/H09TXR3UUZB2ISDL?source=fedsrch