Exploiting Data, Task, and Model Structure for Supervision-Efficient Natural Language Processing

Gardado en:
Detalles Bibliográficos
Publicado en:ProQuest Dissertations and Theses (2025)
Autor Principal: Wang, Zihan
Publicado:
ProQuest Dissertations & Theses
Materias:
Acceso en liña:Citation/Abstract
Full Text - PDF
Etiquetas: Engadir etiqueta
Sen Etiquetas, Sexa o primeiro en etiquetar este rexistro!

MARC

LEADER 00000nab a2200000uu 4500
001 3285841983
003 UK-CbPIL
020 |a 9798270246198 
035 |a 3285841983 
045 2 |b d20250101  |b d20251231 
084 |a 66569  |2 nlm 
100 1 |a Wang, Zihan 
245 1 |a Exploiting Data, Task, and Model Structure for Supervision-Efficient Natural Language Processing 
260 |b ProQuest Dissertations & Theses  |c 2025 
513 |a Dissertation/Thesis 
520 3 |a Effective natural language processing systems typically require extensive human annotations, creating a major bottleneck for deploying models on new tasks. This thesis develops methods that reduce the dependence on human supervision by exploiting the inherent structure of the data, task, and language models themselves.First, we present X-Class, which performs text classification using only class names by exploiting corpus-level distributional structure. Rather than requiring labeled examples, the method learns adaptive document representations that align with the given classes through clustering, allowing the corpus itself to provide supervisory signal. Specifically, X-Class estimates class representations by incrementally adding similar words, obtains document representations via class-attention mechanisms, and trains classifiers on confident pseudo-labeled documents. Extensive experiments demonstrate that X-Class can rival and even outperform seed-driven weakly supervised methods on seven benchmark datasets.Second, we introduce Goal-Driven Explainable Clustering (GoalEx), which exploits task structure by decomposing clustering into a propose-assign-select pipeline: language models generate candidate cluster explanations conditioned on user goals, and optimization selects the subset that best covers the corpus. This task decomposition naturally produces interpretable outputs—each cluster comes with a human-readable explanation of what it represents. Under both automatic and human evaluation, our method produces more accurate and goal-related explanations than prior methods.Third, we present FFF-NER, a few-shot fine-tuning framework for Named Entity Recognition that exploits task structure by aligning fine-tuning with pre-training objectives. We hypothesize that fine-tuning performance improves when the fine-tuning task resembles the pre-training task. By decoupling span detection from type prediction and formulating NER as masked token prediction, our method achieves state-of-the-art few-shot NER performance on ten benchmark datasets.Fourth, we present Model-induced Process Supervision (MiPS), which exploits the structure of language model reasoning itself. By sampling how well partial solutions can be completed, the method automatically generates training labels for verifying multi-step reasoning, removing the need for expensive step-by-step human annotation. Our approach significantly improves performance on math and coding tasks compared to output-supervised verifiers. Together, these works establish that systematic exploitation of structure—whether in data distributions, task formulations, or model behaviors—can effectively replace or augment human supervision, enabling scalable and interpretable NLP systems. 
653 |a Computer science 
653 |a Computer engineering 
653 |a Information science 
773 0 |t ProQuest Dissertations and Theses  |g (2025) 
786 0 |d ProQuest  |t ProQuest Dissertations & Theses Global 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3285841983/abstract/embedded/09EF48XIB41FVQI7?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3285841983/fulltextPDF/embedded/09EF48XIB41FVQI7?source=fedsrch