Machine Learning for Malicious URL Classification with Expanded Feature Selection and Natural Language Processing: A Temporal Analysis

Bewaard in:

Bibliografische gegevens
Gepubliceerd in:	ProQuest Dissertations and Theses (2025)
Hoofdauteur:	Perry, Van
Gepubliceerd in:	ProQuest Dissertations & Theses
Onderwerpen:	Artificial intelligence Computer engineering Information science
Online toegang:	Citation/Abstract Full Text - PDF
Tags:	Voeg label toe Geen labels, Wees de eerste die dit record labelt!

MARC


LEADER	00000nab a2200000uu 4500
001	3187648098
003	UK-CbPIL
020			\|a 9798310350151
035			\|a 3187648098
045	2		\|b d20250101 \|b d20251231
084			\|a 66569 \|2 nlm
100	1		\|a Perry, Van
245	1		\|a Machine Learning for Malicious URL Classification with Expanded Feature Selection and Natural Language Processing: A Temporal Analysis
260			\|b ProQuest Dissertations & Theses \|c 2025
513			\|a Dissertation/Thesis
520	3		\|a This praxis further investigates the research performed by Evan Wehr (2024), who argued that URLs change over time, and that when Machine Learning (ML) is applied to malicious URL classification, performance should decay over time. This means that ML performance should decay over time when applied to malicious URL classification. Wehr’s (2024) research does not include the use of natural language processing for malicious URL classification, to which this praxis extends. Traditional approaches to ML model training and testing assume static datasets, neglecting the temporal dynamics inherent in URLs. By addressing this gap, the aim is to determine the effectiveness of incorporating natural language-based features in enhancing model performance and resilience to concept drift over time. The research performed demonstrates the potential improvements or shortcomings of including natural language processing in a temporal analysis over existing selections. To test the hypotheses, a dataset comprising of 2,292,882 URLs, one of the largest in this domain, was used. The temporal analysis revealed the presence of concept drift and indicated potential performance decay. Models resistant to such decay, such as XGB, LR, and NB, with normalization and standardization, exhibited the strong lasting power. This study underscores the importance of considering temporal dynamics and feature selection in designing robust ML solutions for malicious URL classification, providing valuable insights for security engineers to make informed decisions in safeguarding against evolving threats.
653			\|a Artificial intelligence
653			\|a Computer engineering
653			\|a Information science
773	0		\|t ProQuest Dissertations and Theses \|g (2025)
786	0		\|d ProQuest \|t Publicly Available Content Database
856	4	1	\|3 Citation/Abstract \|u https://www.proquest.com/docview/3187648098/abstract/embedded/6A8EOT78XXH2IG52?source=fedsrch
856	4	0	\|3 Full Text - PDF \|u https://www.proquest.com/docview/3187648098/fulltextPDF/embedded/6A8EOT78XXH2IG52?source=fedsrch