AI driven web crawling for semantic extraction of news content from newspapers
Guardat en:
| Publicat a: | Scientific Reports (Nature Publisher Group) vol. 15, no. 1 (2025), p. 41673-41692 |
|---|---|
| Autor principal: | |
| Altres autors: | |
| Publicat: |
Nature Publishing Group
|
| Matèries: | |
| Accés en línia: | Citation/Abstract Full Text Full Text - PDF |
| Etiquetes: |
Sense etiquetes, Sigues el primer a etiquetar aquest registre!
|
| Resum: | Efficient data extraction from the ever-expanding web, including structured and unstructured sources such as newspaper databases, is critical for industries like media, research, and journalism. Traditional web crawlers, which are primarily rule-based or keyword-driven, struggle with adaptability, semantic understanding, and real-time responsiveness when working with diverse data formats and layouts found in newspaper archives. This research proposes WISE (Web-Intelligent Semantic Extractor), an intelligent, deep learning-based framework designed to overcome these challenges. By integrating Natural Language Processing (NLP) and neural networks, WISE can extract contextually relevant information from dynamic newspaper databases, improving both accuracy and efficiency in data retrieval. The system dynamically adjusts crawling strategies based on content semantics, learning patterns from diverse data sources to enhance relevance and reduce noise. WISE outperforms conventional rule-based, keyword-driven, and non-semantic crawlers by 35% in terms of extraction accuracy and 40% in terms of processing efficiency, according to experimental evaluations conducted on benchmark datasets and actual online environments. WISE showed exceptional scalability, contextual accuracy, semantic understanding, and real-time flexibility in a variety of online scenarios using the News Articles Classification Dataset (Kaggle) and real-time newspaper sources. The framework demonstrates superior performance in extracting structured data from heterogeneous sources while maintaining scalability and security. This work presents a novel, intelligent solution designed to meet the challenges of modern web environments. |
|---|---|
| ISSN: | 2045-2322 |
| DOI: | 10.1038/s41598-025-25616-x |
| Font: | Science Database |