The Amount of Data Required to Recognize a Writer’s Style Is Consistent Across Different Languages of the World

Gorde:

Xehetasun bibliografikoak
Argitaratua izan da:	Entropy vol. 27, no. 10 (2025), p. 1039-1051
Egile nagusia:	Ryabko Boris
Beste egile batzuk:	Savina Nadezhda, Getachew, Lulu Yeshewas, Han, Yunfei
Argitaratua:	MDPI AG
Gaiak:	Linguistics Writers Sino-Tibetan languages Hypothesis testing Data compression Russian language Information theory Indo-European languages Languages West Germanic languages Computational linguistics Methods Fiction Semitic languages Computer science Compression Amharic Information sources Chinese languages Statistical analysis Families & family life Literary criticism English language Germanic languages Data Asian cultural groups Slavic cultural groups
Sarrera elektronikoa:	Citation/Abstract Full Text Full Text - PDF
Etiketak:	Etiketa erantsi Etiketarik gabe, Izan zaitez lehena erregistro honi etiketa jartzen!

Deskribapena
Laburpena:	In this paper, we apply an information-theoretic method proposed by Ryabko and Savina (therefore called the RS-method), based on the use of data compression, to recognize the individual author’s style of a writer across four languages from different language groups and families. In this paper, the presented method was used to study fiction texts in Russian (East Slavic group of languages of the Indo-European language family), Amharic (South Ethiosemitic group of the Semitic language family), Chinese (Sinitic group of the Sino-Tibetan language family) and English (West Germanic language group of the Indo-European language family). It was found that the amount of data necessary for recognizing an author’s style is almost the same for all four languages, i.e., the amount of data is invariant across different language groups. The results obtained are of interest to computer science, literary studies, linguistics and, in particular, computational linguistics.
ISSN:	1099-4300
DOI:	10.3390/e27101039
Baliabidea:	Engineering Database