Comparison of algorithms for the recognition of ChatGPT paraphrased texts

Guardado en:
Bibliografiske detaljer
Udgivet i:Journal of Big Data vol. 12, no. 1 (Feb 2025), p. 28
Udgivet:
Springer Nature B.V.
Fag:
Online adgang:Citation/Abstract
Full Text - PDF
Tags: Tilføj Tag
Ingen Tags, Vær først til at tagge denne postø!

MARC

LEADER 00000nab a2200000uu 4500
001 3164175026
003 UK-CbPIL
022 |a 2196-1115 
024 7 |a 10.1186/s40537-025-01082-0  |2 doi 
035 |a 3164175026 
045 2 |b d20250201  |b d20250228 
245 1 |a Comparison of algorithms for the recognition of ChatGPT paraphrased texts 
260 |b Springer Nature B.V.  |c Feb 2025 
513 |a Journal Article 
520 3 |a The rapid development of artificial intelligence, especially AI assistants, is leading to new forms of plagiarism that are difficult to detect using existing methods. Paraphrasing tools make this problem even more complex and challenging especially in minor languages with inadequate resources and tools. This study explores strategies to help detect plagiarism generated by ChatGPT 4.0 and altered by paraphrasing tools. We propose two new datasets consisting of abstracts of doctoral theses in English and Serbian. Both datasets were subjected to ChatGPT paraphrasing, which allowed us to form two classes of texts: human-written and AI-generated, i.e., AI-paraphrased. We then comprehensively compare 19 widely used classification algorithms based on two feature sets: word unigrams and character multigrams. In addition, we compare these to the results of a commercially available pre-trained ChatGPT content detector, ZeroGPT. The results on the English corpus turn out to be very accurate, achieving an accuracy of 95% or more. In contrast, the results on the Serbian corpus were less accurate, achieving an accuracy of just over 85%. Syntax analysis of the training datasets has shown that in Serbian GPT-paraphrased texts, 33.2% of sentences remain the same, and they are found in 63% of documents. GPT-paraphrased English texts showed that 3.2% of sentences remain the same, and they are found in 16% of documents. Syntax analysis of the test datasets has shown that the change of the model temperature influences syntactic features (average number of words and sentences) in English texts and slightly or not in Serbian texts. We attribute all these differences to GPT’s lower paraphrasing ability in minor languages such as Serbian. Presented findings underscore the necessity for making persistent effort in developing tools made for detecting AI-paraphrased texts in academic and professional settings, particularly for minor languages with limited NLP resources, to preserve content integrity and authenticity. 
653 |a Datasets 
653 |a Syntax 
653 |a Artificial intelligence 
653 |a Chatbots 
653 |a Texts 
653 |a Languages 
653 |a Documents 
653 |a Algorithms 
653 |a Plagiarism 
653 |a Sentences 
653 |a Big Data 
653 |a Serbo-Croatian language 
653 |a Paraphrase 
653 |a English language 
653 |a Classification 
653 |a Corpus linguistics 
653 |a Syntactic analysis 
653 |a Human-computer interaction 
653 |a Natural language processing 
653 |a Morality 
653 |a Syntactic features 
653 |a Accuracy 
773 0 |t Journal of Big Data  |g vol. 12, no. 1 (Feb 2025), p. 28 
786 0 |d ProQuest  |t ABI/INFORM Global 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3164175026/abstract/embedded/6A8EOT78XXH2IG52?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3164175026/fulltextPDF/embedded/6A8EOT78XXH2IG52?source=fedsrch