A Machine Learning-Based Approach For Detecting Malicious PyPI Packages

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org (Dec 6, 2024), p. n/a
1. Verfasser:	Samaana, Haya
Weitere Verfasser:	Diego Elias Costa, Shihab, Emad, Abdellatif, Ahmad
Veröffentlicht:	Cornell University Library, arXiv.org
Schlagworte:	Machine learning Packages Software development Static code analysis Ensemble learning Computer programming
Online-Zugang:	Citation/Abstract Full text outside of ProQuest
Tags:	Tag hinzufügen Keine Tags, Fügen Sie das erste Tag hinzu!

MARC


LEADER	00000nab a2200000uu 4500
001	3142374147
003	UK-CbPIL
022			\|a 2331-8422
035			\|a 3142374147
045	0		\|b d20241206
100	1		\|a Samaana, Haya
245	1		\|a A Machine Learning-Based Approach For Detecting Malicious PyPI Packages
260			\|b Cornell University Library, arXiv.org \|c Dec 6, 2024
513			\|a Working Paper
520	3		\|a Background. In modern software development, the use of external libraries and packages is increasingly prevalent, streamlining the software development process and enabling developers to deploy feature-rich systems with little coding. While this reliance on reusing code offers substantial benefits, it also introduces serious risks for deployed software in the form of malicious packages - harmful and vulnerable code disguised as useful libraries. Aims. Popular ecosystems, such PyPI, receive thousands of new package contributions every week, and distinguishing safe contributions from harmful ones presents a significant challenge. There is a dire need for reliable methods to detect and address the presence of malicious packages in these environments. Method. To address these challenges, we propose a data-driven approach that uses machine learning and static analysis to examine the package's metadata, code, files, and textual characteristics to identify malicious packages. Results. In evaluations conducted within the PyPI ecosystem, we achieved an F1-measure of 0.94 for identifying malicious packages using a stacking ensemble classifier. Conclusions. This tool can be seamlessly integrated into package vetting pipelines and has the capability to flag entire packages, not just malicious function calls. This enhancement strengthens security measures and reduces the manual workload for developers and registry maintainers, thereby contributing to the overall integrity of the ecosystem.
653			\|a Machine learning
653			\|a Packages
653			\|a Software development
653			\|a Static code analysis
653			\|a Ensemble learning
653			\|a Computer programming
700	1		\|a Diego Elias Costa
700	1		\|a Shihab, Emad
700	1		\|a Abdellatif, Ahmad
773	0		\|t arXiv.org \|g (Dec 6, 2024), p. n/a
786	0		\|d ProQuest \|t Engineering Database
856	4	1	\|3 Citation/Abstract \|u https://www.proquest.com/docview/3142374147/abstract/embedded/6A8EOT78XXH2IG52?source=fedsrch
856	4	0	\|3 Full text outside of ProQuest \|u http://arxiv.org/abs/2412.05259