Addressing Data Leakage and Imbalance for Robust Fine-Tuning of Pre-Trained Code Language Models in Program Repair and Vulnerability Detection
Uloženo v:
| Vydáno v: | ProQuest Dissertations and Theses (2025) |
|---|---|
| Hlavní autor: | |
| Vydáno: |
ProQuest Dissertations & Theses
|
| Témata: | |
| On-line přístup: | Citation/Abstract Full Text - PDF |
| Tagy: |
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstrakt: | Large Language Models (LLMs) hold significant promise for advancing automated vulnerability detection and program repair, yet their effectiveness is often constrained by fundamental issues in dataset construction and learning under severe class imbalance. In this work, we investigate two critical and underexplored challenges that limit real-world applicability.First, in the domain of program repair, we show that the widely used VulRepair dataset and the numerous papers built upon it suffer from pervasive duplication and leakage, incomplete samples, and inaccurate labels. To assess the impact of these dataset deficiencies on model performance, I produced several deduplicated versions of the datasets, systematically removing duplicates from training while retaining them in testing, or removing them from testing while retaining them in training and tested the models using those datasets. Our empirical analysis reveals that previous studies have substantially overestimated performance: the VulRepair model, previously reported to achieve 44% accuracy, yields only 9–13% accuracy when dataset leakage is removed. Through extensive label verification across the ten most hazardous Common Weakness Enumerations (CWEs), we find that 56% of samples have incorrect labels and 44% are incomplete (only 31% are both accurate and complete) underscoring the urgent need for rigorous dataset curation. To address performance limitations, we further employ transfer learning with a large deduplicated bug-fix corpus, demonstrating that LLM-based repair models can achieve improved results when trained on high-quality data.Second, in the context of vulnerability detection, we examine the impact of extreme class imbalance using PrimeVul, a realistic large-scale benchmark. We conduct a systematic evaluation of advanced sampling techniques and specialized loss function— Focal Loss to improve recall and reduce false negative rates, and to optimize performance in low false positive regions. Our results demonstrate that combining targeted sampling with robust loss function can yield gains in detection performance, lowering both false negative rate (FNR) and false positive rate (FPR) compared to conventional training approaches.Taken together, these findings highlight that while previous evaluations often relied on flawed assumptions or inadequate data quality, LLM-based methods remain promising when paired with principled dataset engineering and learning strategies that are tailored to the unique challenges of vulnerability detection and repair. |
|---|---|
| ISBN: | 9798297615373 |
| Zdroj: | ProQuest Dissertations & Theses Global |