Scalable Data Paradigms for Steering General-Purpose Language Models

Guardado en:
Detalles Bibliográficos
Publicado en:ProQuest Dissertations and Theses (2026)
Autor principal: Wang, Yizhong
Publicado:
ProQuest Dissertations & Theses
Materias:
Acceso en línea:Citation/Abstract
Full Text - PDF
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!

MARC

LEADER 00000nab a2200000uu 4500
001 3230023261
003 UK-CbPIL
020 |a 9798288834073 
035 |a 3230023261 
045 2 |b d20260101  |b d20261231 
084 |a 66569  |2 nlm 
100 1 |a Wang, Yizhong 
245 1 |a Scalable Data Paradigms for Steering General-Purpose Language Models 
260 |b ProQuest Dissertations & Theses  |c 2026 
513 |a Dissertation/Thesis 
520 3 |a Pretrained Language Models (LMs) have demonstrated remarkable general-purpose capabilities by encoding vast amounts of knowledge from the internet. However, effectively steering these models to serve diverse downstream applications, such as following instructions, chatting with users, using tools, or performing complex reasoning, poses another set of challenges that require diverse, high-quality, and increasingly costly training data.This dissertation explores scalable paradigms for structuring, creating, and optimizing data to facilitate the broader generalization of language models and enhance their critical capabilities. First, through the creation of the Super-NaturalInstructions benchmark—a large-scale dataset with over 1,600 NLP tasks—I demonstrate that unifying NLP tasks via natural language instructions enables model generalization at the task level. Second, I propose Self-Instruct, a novel framework where LMs generate their own instructional data to train themselves, thereby demonstrating model self-improvement. Third, I develop HyPER, a framework that routes preference annotation tasks between humans and AI to optimize data quality and collection efficiency for preference-based learning. Finally, I systematically study the impact of diverse open instruction-tuning datasets on LM capabilities, leading to the development of the Tülu series of openly available and highly capable models.Together, these efforts—unifying task structures, leveraging model-generated synthetic data, optimizing human-AI data partnerships, and fostering open data ecosystems—have demonstrated an effective path to building a strong, scalable, and community-driven data foundation for post-training language models. Finally, I envision future directions that can further enhance this data foundation for building more advanced and sustainable AI systems. 
653 |a Artificial intelligence 
653 |a Computer engineering 
653 |a Computer science 
773 0 |t ProQuest Dissertations and Theses  |g (2026) 
786 0 |d ProQuest  |t ProQuest Dissertations & Theses Global 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3230023261/abstract/embedded/H09TXR3UUZB2ISDL?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3230023261/fulltextPDF/embedded/H09TXR3UUZB2ISDL?source=fedsrch