AI driven web crawling for semantic extraction of news content from newspapers

Guardat en:
Dades bibliogràfiques
Publicat a:Scientific Reports (Nature Publisher Group) vol. 15, no. 1 (2025), p. 41673-41692
Autor principal: S, Saravanan
Altres autors: A K, Ashfauk Ahamed
Publicat:
Nature Publishing Group
Matèries:
Accés en línia:Citation/Abstract
Full Text
Full Text - PDF
Etiquetes: Afegir etiqueta
Sense etiquetes, Sigues el primer a etiquetar aquest registre!

MARC

LEADER 00000nab a2200000uu 4500
001 3275323990
003 UK-CbPIL
022 |a 2045-2322 
024 7 |a 10.1038/s41598-025-25616-x  |2 doi 
035 |a 3275323990 
045 2 |b d20250101  |b d20251231 
084 |a 274855  |2 nlm 
100 1 |a S, Saravanan  |u Department of Computer Science and Engineering, B.S.Abdur Rahman Crescent Institute of Science & Technology, 600 048, Chennai, India (ROR: https://ror.org/01fqhas03) (GRID: grid.449273.f) (ISNI: 0000 0004 7593 9565) 
245 1 |a AI driven web crawling for semantic extraction of news content from newspapers 
260 |b Nature Publishing Group  |c 2025 
513 |a Journal Article 
520 3 |a Efficient data extraction from the ever-expanding web, including structured and unstructured sources such as newspaper databases, is critical for industries like media, research, and journalism. Traditional web crawlers, which are primarily rule-based or keyword-driven, struggle with adaptability, semantic understanding, and real-time responsiveness when working with diverse data formats and layouts found in newspaper archives. This research proposes WISE (Web-Intelligent Semantic Extractor), an intelligent, deep learning-based framework designed to overcome these challenges. By integrating Natural Language Processing (NLP) and neural networks, WISE can extract contextually relevant information from dynamic newspaper databases, improving both accuracy and efficiency in data retrieval. The system dynamically adjusts crawling strategies based on content semantics, learning patterns from diverse data sources to enhance relevance and reduce noise. WISE outperforms conventional rule-based, keyword-driven, and non-semantic crawlers by 35% in terms of extraction accuracy and 40% in terms of processing efficiency, according to experimental evaluations conducted on benchmark datasets and actual online environments. WISE showed exceptional scalability, contextual accuracy, semantic understanding, and real-time flexibility in a variety of online scenarios using the News Articles Classification Dataset (Kaggle) and real-time newspaper sources. The framework demonstrates superior performance in extracting structured data from heterogeneous sources while maintaining scalability and security. This work presents a novel, intelligent solution designed to meet the challenges of modern web environments. 
653 |a Language 
653 |a Databases 
653 |a Machine learning 
653 |a Deep learning 
653 |a Noise reduction 
653 |a Artificial intelligence 
653 |a Newspapers 
653 |a Semantics 
653 |a Data mining 
653 |a Neural networks 
653 |a Data processing 
653 |a Archives & records 
653 |a Electronic newspapers 
653 |a Semantic web 
653 |a Natural language processing 
653 |a Information processing 
653 |a Algorithms 
653 |a Social 
700 1 |a A K, Ashfauk Ahamed  |u Department of Computer Applications, B.S.Abdur Rahman Crescent Institute of Science & Technology, 600 048, Chennai, India (ROR: https://ror.org/01fqhas03) (GRID: grid.449273.f) (ISNI: 0000 0004 7593 9565) 
773 0 |t Scientific Reports (Nature Publisher Group)  |g vol. 15, no. 1 (2025), p. 41673-41692 
786 0 |d ProQuest  |t Science Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3275323990/abstract/embedded/6A8EOT78XXH2IG52?source=fedsrch 
856 4 0 |3 Full Text  |u https://www.proquest.com/docview/3275323990/fulltext/embedded/6A8EOT78XXH2IG52?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3275323990/fulltextPDF/embedded/6A8EOT78XXH2IG52?source=fedsrch