Scalable Data Paradigms for Steering General-Purpose Language Models

Guardado en:
Detalles Bibliográficos
Publicado en:ProQuest Dissertations and Theses (2026)
Autor principal: Wang, Yizhong
Publicado:
ProQuest Dissertations & Theses
Materias:
Acceso en línea:Citation/Abstract
Full Text - PDF
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
Descripción
Resumen:Pretrained Language Models (LMs) have demonstrated remarkable general-purpose capabilities by encoding vast amounts of knowledge from the internet. However, effectively steering these models to serve diverse downstream applications, such as following instructions, chatting with users, using tools, or performing complex reasoning, poses another set of challenges that require diverse, high-quality, and increasingly costly training data.This dissertation explores scalable paradigms for structuring, creating, and optimizing data to facilitate the broader generalization of language models and enhance their critical capabilities. First, through the creation of the Super-NaturalInstructions benchmark—a large-scale dataset with over 1,600 NLP tasks—I demonstrate that unifying NLP tasks via natural language instructions enables model generalization at the task level. Second, I propose Self-Instruct, a novel framework where LMs generate their own instructional data to train themselves, thereby demonstrating model self-improvement. Third, I develop HyPER, a framework that routes preference annotation tasks between humans and AI to optimize data quality and collection efficiency for preference-based learning. Finally, I systematically study the impact of diverse open instruction-tuning datasets on LM capabilities, leading to the development of the Tülu series of openly available and highly capable models.Together, these efforts—unifying task structures, leveraging model-generated synthetic data, optimizing human-AI data partnerships, and fostering open data ecosystems—have demonstrated an effective path to building a strong, scalable, and community-driven data foundation for post-training language models. Finally, I envision future directions that can further enhance this data foundation for building more advanced and sustainable AI systems.
ISBN:9798288834073
Fuente:ProQuest Dissertations & Theses Global