Quantifying cross-language code reuse via function-level clone detection
Guardat en:
| Publicat a: | Journal of King Saud University. Computer and Information Sciences vol. 37, no. 10 (Dec 2025), p. 327 |
|---|---|
| Autor principal: | |
| Altres autors: | |
| Publicat: |
Springer Nature B.V.
|
| Matèries: | |
| Accés en línia: | Citation/Abstract Full Text Full Text - PDF |
| Etiquetes: |
Sense etiquetes, Sigues el primer a etiquetar aquest registre!
|
| Resum: | Code reuse through cloning is common in software development, yet excessive or unchecked cloning can harm maintainability and raise plagiarism concerns. Detecting the proportion of reused (cloned) code in a software project, especially across different programming languages, is a challenging task. This paper defines code reuse proportion detection as measuring how much code in a target program is cloned (identical or similar) from elsewhere. Existing code clone detection techniques perform well in single-language settings but struggle with cross-language clones and do not directly quantify reuse proportion. To address these gaps, we propose a novel cross-language function-level code clone detection approach using a dual embedding Siamese neural network. Our method represents code in Java and Python using a unified abstract syntax structure and semantic embeddings, then uses a Siamese deep network to learn language-agnostic similarities. We also introduce a metric to quantify the clone-based reuse ratio for each function or program. Experiments on three public datasets (including a Java clone benchmark, a Python code clone corpus, and a cross-language Java–Python clone dataset) show that our approach outperforms ten baseline methods, including state-of-the-art and classical clone detectors. Ablation studies confirm the contribution of each component (structural embeddings, cross-language alignment, and contrastive learning) to performance gains. Our model achieves new state-of-the-art accuracy in code clone detection, enabling precise measurement of code reuse. These results demonstrate that the proposed approach can effectively detect cross-language code clones and quantify reuse proportion, benefiting software plagiarism detection and code quality assessment in multi-language projects. |
|---|---|
| ISSN: | 1319-1578 |
| DOI: | 10.1007/s44443-025-00362-2 |
| Font: | Computer Science Database |