Towards better Hebrew clickbait detection: Insights from BERT and data augmentation

Shranjeno v:

Bibliografske podrobnosti
izdano v:	PLoS One vol. 20, no. 11 (Nov 2025), p. e0332342
Glavni avtor:	Natanya, Talya
Drugi avtorji:	Liebeskind, Chaya
Izdano:	Public Library of Science
Teme:	Language Accuracy Machine learning Text categorization Data augmentation Benchmarks Deep learning Datasets Configuration management Classification Multilingualism Natural language processing Large language models Learning algorithms Digital media Semantics False information Readers Economic
Online dostop:	Citation/Abstract Full Text Full Text - PDF
Oznake:	Označite Brez oznak, prvi označite!

Opis
Resumen:	Clickbait headlines, designed to entice readers with sensationalized or misleading content, pose significant challenges in the digital landscape. They exploit curiosity to generate traffic and revenue, often at the cost of spreading misinformation and undermining the credibility of online content. Identifying clickbait is essential for improving the quality of information consumed, fostering trust in digital media, and enabling users to make informed decisions. This study advances Hebrew clickbait detection through deep learning approaches and comprehensive data augmentation strategies, targeting the unique challenges of processing a low-resource language. Building on prior research that achieved an accuracy of 87% using traditional machine learning methods, this work explores the potential of BERT-based models and diverse augmentation techniques to further enhance performance. Our experiments incorporated a variety of augmentation methods, including weak supervision, substitution-based methods, generative techniques and language-based methods, applied to state-of-the-art Hebrew language models. The results highlight that targeted augmentation strategies, particularly those focusing on word-level replacements and contextual enhancements, consistently improved model performance. Our top-performing configuration achieved an accuracy of 92%, surpassing traditional machine learning benchmarks. These study results can be applied in real-world systems to automatically detect and reduce clickbait in Hebrew digital media, supporting news websites and social platforms in improving content quality and user trust. Furthermore, it provides a replicable framework for tackling similar challenges in other underrepresented languages, highlighting the transformative potential of combining advanced deep learning methods with tailored data augmentation strategies.
ISSN:	1932-6203
DOI:	10.1371/journal.pone.0332342
Fuente:	Health & Medical Collection