Quantifying cross-language code reuse via function-level clone detection

Guardat en:

Dades bibliogràfiques
Publicat a:	Journal of King Saud University. Computer and Information Sciences vol. 37, no. 10 (Dec 2025), p. 327
Autor principal:	Rong, Yi
Altres autors:	Zhou, Yan
Publicat:	Springer Nature B.V.
Matèries:	Language Maintainability Software Datasets Quality assessment Java Deep learning Neural networks Syntax Artificial neural networks Cloning Programming languages Sensors Ablation Python Plagiarism Software reuse Code reuse Software development Semantics
Accés en línia:	Citation/Abstract Full Text Full Text - PDF
Etiquetes:	Afegir etiqueta Sense etiquetes, Sigues el primer a etiquetar aquest registre!

Descripció
Resum:	Code reuse through cloning is common in software development, yet excessive or unchecked cloning can harm maintainability and raise plagiarism concerns. Detecting the proportion of reused (cloned) code in a software project, especially across different programming languages, is a challenging task. This paper defines code reuse proportion detection as measuring how much code in a target program is cloned (identical or similar) from elsewhere. Existing code clone detection techniques perform well in single-language settings but struggle with cross-language clones and do not directly quantify reuse proportion. To address these gaps, we propose a novel cross-language function-level code clone detection approach using a dual embedding Siamese neural network. Our method represents code in Java and Python using a unified abstract syntax structure and semantic embeddings, then uses a Siamese deep network to learn language-agnostic similarities. We also introduce a metric to quantify the clone-based reuse ratio for each function or program. Experiments on three public datasets (including a Java clone benchmark, a Python code clone corpus, and a cross-language Java–Python clone dataset) show that our approach outperforms ten baseline methods, including state-of-the-art and classical clone detectors. Ablation studies confirm the contribution of each component (structural embeddings, cross-language alignment, and contrastive learning) to performance gains. Our model achieves new state-of-the-art accuracy in code clone detection, enabling precise measurement of code reuse. These results demonstrate that the proposed approach can effectively detect cross-language code clones and quantify reuse proportion, benefiting software plagiarism detection and code quality assessment in multi-language projects.
ISSN:	1319-1578
DOI:	10.1007/s44443-025-00362-2
Font:	Computer Science Database