Towards Intelligent Reliable Code Retrieval Based on Code Semantics Learning

Guardado en:
Detalles Bibliográficos
Publicado en:PQDT - Global (2024)
Autor principal: Gu, Wenchao
Publicado:
ProQuest Dissertations & Theses
Materias:
Acceso en línea:Citation/Abstract
Full Text - PDF
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
Descripción
Resumen:With the large-scale application of software in various industries, the demand for software development has snowballed in recent decades. Code retrieval, which can retrieve the users' desired code snippets from the code database according to their natural language description, can significantly reduce the workload of software developers. Therefore, code retrieval is an important research topic. However, the retrieved code snippets may be vulnerable and cannot be directly used. Vulnerability detection for the retrieved code snippets is necessary. In this thesis, we present our exploration of the task of code retrieval and software vulnerability detection. Specifically, we aim to address several common challenges in effective code retrieval, code retrieval acceleration, and software vulnerability detection from the following four parts.Firstly, we study the problem of code retrieval. Considering the highly structured characteristic of source code, we propose a novel neural network model named CRADLE. CRADLe couples both structural and semantic information of code at the statement level, where the code structures are extracted based on the program dependency graph. The evaluation results show that CRADLe can significantly outperform the state-of-the-art baseline models.Secondly, we shift to the problem of code retrieval efficiency. Current deep learning-based approaches need to rank all the source code snippets in the corpus during searching, which will incur a large amount of computational cost. To address this problem, we propose a novel approach named CoSHC. COSHC clusters the representation vectors into different categories and generates binary hash codes for both source code and queries. During the retrieval, CoSHC will retrieve the different number of code candidates for the given query in each category. The evaluation results show that CoSHC can preserve most of the performance from the original models while significantly reducing the retrieval time.Thirdly, we focus on how to improve the code retrieval efficiency further. Although it is very efficient to calculate the Hamming distance, these Hamming distance-based methods have to scan the whole database, which leads to a considerable expensive computation cost. To address this problem, we propose a hash table-based code retrieval framework CSSDH that achieves advanced performance by replacing the Hamming distance calculation with lookup hash tables. Experimental results indicate that CSSDH can significantly reduce the retrieval time of current state-of-the-art deep hashing approaches, retain comparable performance, or even outperform the previous deep hashing approaches in the recall step.Fourthly, we shift to the problem of vulnerability detection. Previous deep learning-based approaches have struggled to achieve accurate vulnerability localization, as they do not prioritize the localization problem during training. Automatically predicting statement-level vulnerabilities in a supervised manner poses difficulties, as it necessitates labeled data for model learning. To address this issue, we propose a novel approach named WILDE for function-level vulnerability detection with statement-level localization. WILDE can achieve the statement-level vulnerability localization without the statement-level labeled data. The extensive experimental findings showcase that WILDE achieves comparable performance in detecting vulnerabilities at the function level compared, and its ability to localize vulnerabilities surpasses that of the previous models.
ISBN:9798304977500
Fuente:ProQuest Dissertations & Theses Global