Exploring RAG-based Vulnerability Augmentation with LLMs

Salvato in:
Dettagli Bibliografici
Pubblicato in:arXiv.org (Dec 5, 2024), p. n/a
Autore principale: Seyed Shayan Daneshvar
Altri autori: Yu, Nong, Xu, Yang, Wang, Shaowei, Cai, Haipeng
Pubblicazione:
Cornell University Library, arXiv.org
Soggetti:
Accesso online:Citation/Abstract
Full text outside of ProQuest
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!

MARC

LEADER 00000nab a2200000uu 4500
001 3091015001
003 UK-CbPIL
022 |a 2331-8422 
035 |a 3091015001 
045 0 |b d20241205 
100 1 |a Seyed Shayan Daneshvar 
245 1 |a Exploring RAG-based Vulnerability Augmentation with LLMs 
260 |b Cornell University Library, arXiv.org  |c Dec 5, 2024 
513 |a Working Paper 
520 3 |a Detecting vulnerabilities is vital for software security, yet deep learning-based vulnerability detectors (DLVD) face a data shortage, which limits their effectiveness. Data augmentation can potentially alleviate the data shortage, but augmenting vulnerable code is challenging and requires a generative solution that maintains vulnerability. Previous works have only focused on generating samples that contain single statements or specific types of vulnerabilities. Recently, large language models (LLMs) have been used to solve various code generation and comprehension tasks with inspiring results, especially when fused with retrieval augmented generation (RAG). Therefore, we propose VulScribeR, a novel LLM-based solution that leverages carefully curated prompt templates to augment vulnerable datasets. More specifically, we explore three strategies to augment both single and multi-statement vulnerabilities, with LLMs, namely Mutation, Injection, and Extension. Our extensive evaluation across three vulnerability datasets and DLVD models, using two LLMs, show that our approach beats two SOTA methods Vulgen and VGX, and Random Oversampling (ROS) by 27.48%, 27.93%, and 15.41% in f1-score with 5K generated vulnerable samples on average, and 53.84%, 54.10%, 69.90%, and 40.93% with 15K generated vulnerable samples. Our approach demonstrates its feasibility for large-scale data augmentation by generating 1K samples at as cheap as US$ 1.88. 
653 |a Software reliability 
653 |a Computer program integrity 
653 |a Datasets 
653 |a Data augmentation 
653 |a Vulnerability 
653 |a Large language models 
653 |a Machine learning 
653 |a Clustering 
653 |a Shortages 
700 1 |a Yu, Nong 
700 1 |a Xu, Yang 
700 1 |a Wang, Shaowei 
700 1 |a Cai, Haipeng 
773 0 |t arXiv.org  |g (Dec 5, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3091015001/abstract/embedded/6A8EOT78XXH2IG52?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2408.04125