Algorithmic Data Efficient Learning in the Era of Large Model

Guardado en:
Detalles Bibliográficos
Publicado en:ProQuest Dissertations and Theses (2025)
Autor principal: Chen, Yifang
Publicado:
ProQuest Dissertations & Theses
Materias:
Acceso en línea:Citation/Abstract
Full Text - PDF
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
Descripción
Resumen:In the race towards Artificial General Intelligence, data is the fuel that powers our most advanced models. Vision-Language Models like LLaVA and CLIP are trained on billions of image-text pairs, while Large Language Models (LLMs) like GPT and Claude may process trillions of text samples. Despite the abundance of data, ensuring its quality and effective curation remains more of an art than a science. This process must manage real-world data that is multimodal, noisy, and lacks a guaranteed relationship to target tasks. Furthermore, the process is compounded by the complex training dynamics of neural networks, where the value of each data point depends heavily on the evolving state of model training.Without principled guidance, these challenges often create systematic blind spots, and their impact remains unclear due to a lack of theoretical understanding. My research aims to develop theoretical foundations for data curation through designing theory-inspired algorithms under realistic assumptions and establishing systematic empirical evaluation frameworks to understand the limitations of existing methods including: 1/ target-aware data curation in pretraining 2/label-efficient finetuning 3/ inference-efficient data synthesis and 4/ Interactive learning theories.
ISBN:9798293849888
Fuente:ProQuest Dissertations & Theses Global