Exploiting Data, Task, and Model Structure for Supervision-Efficient Natural Language Processing

Guardat en:

Dades bibliogràfiques
Publicat a:	ProQuest Dissertations and Theses (2025)
Autor principal:	Wang, Zihan
Publicat:	ProQuest Dissertations & Theses
Matèries:	Computer science Computer engineering Information science
Accés en línia:	Citation/Abstract Full Text - PDF
Etiquetes:	Afegir etiqueta Sense etiquetes, Sigues el primer a etiquetar aquest registre!

Descripció
Resum:	Effective natural language processing systems typically require extensive human annotations, creating a major bottleneck for deploying models on new tasks. This thesis develops methods that reduce the dependence on human supervision by exploiting the inherent structure of the data, task, and language models themselves.First, we present X-Class, which performs text classification using only class names by exploiting corpus-level distributional structure. Rather than requiring labeled examples, the method learns adaptive document representations that align with the given classes through clustering, allowing the corpus itself to provide supervisory signal. Specifically, X-Class estimates class representations by incrementally adding similar words, obtains document representations via class-attention mechanisms, and trains classifiers on confident pseudo-labeled documents. Extensive experiments demonstrate that X-Class can rival and even outperform seed-driven weakly supervised methods on seven benchmark datasets.Second, we introduce Goal-Driven Explainable Clustering (GoalEx), which exploits task structure by decomposing clustering into a propose-assign-select pipeline: language models generate candidate cluster explanations conditioned on user goals, and optimization selects the subset that best covers the corpus. This task decomposition naturally produces interpretable outputs—each cluster comes with a human-readable explanation of what it represents. Under both automatic and human evaluation, our method produces more accurate and goal-related explanations than prior methods.Third, we present FFF-NER, a few-shot fine-tuning framework for Named Entity Recognition that exploits task structure by aligning fine-tuning with pre-training objectives. We hypothesize that fine-tuning performance improves when the fine-tuning task resembles the pre-training task. By decoupling span detection from type prediction and formulating NER as masked token prediction, our method achieves state-of-the-art few-shot NER performance on ten benchmark datasets.Fourth, we present Model-induced Process Supervision (MiPS), which exploits the structure of language model reasoning itself. By sampling how well partial solutions can be completed, the method automatically generates training labels for verifying multi-step reasoning, removing the need for expensive step-by-step human annotation. Our approach significantly improves performance on math and coding tasks compared to output-supervised verifiers. Together, these works establish that systematic exploitation of structure—whether in data distributions, task formulations, or model behaviors—can effectively replace or augment human supervision, enabling scalable and interpretable NLP systems.
ISBN:	9798270246198
Font:	ProQuest Dissertations & Theses Global