Effective language identification of forum texts based on statistical approaches

Enregistré dans:
Détails bibliographiques
Publié dans:Information Processing & Management vol. 52, no. 4 (Jul 2016), p. 491
Auteur principal: Abainia, Kheireddine
Autres auteurs: Ouamour, Siham, Sayoud, Halim
Publié:
Elsevier Science Ltd.
Sujets:
Accès en ligne:Citation/Abstract
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!

MARC

LEADER 00000nab a2200000uu 4500
001 1792389416
003 UK-CbPIL
022 |a 0306-4573 
022 |a 1873-5371 
022 |a 0020-0271 
024 7 |a 10.1016/j.ipm.2015.12.003  |2 doi 
035 |a 1792389416 
045 2 |b d20160701  |b d20160731 
084 |a 8483  |2 nlm 
100 1 |a Abainia, Kheireddine 
245 1 |a Effective language identification of forum texts based on statistical approaches 
260 |b Elsevier Science Ltd.  |c Jul 2016 
513 |a Journal Article 
520 3 |a This investigation deals with the problem of language identification of noisy texts, which could represent the primary step of many natural language processing or information retrieval tasks. Language identification is the task of automatically identifying the language of a given text. Although there exists several methods in the literature, their performances are not so convincing in practice. In this contribution, we propose two statistical approaches: the high frequency approach and the nearest prototype approach. In the first one, 5 algorithms of language identification are proposed and implemented, namely: character based identification (CBA), word based identification (WBA), special characters based identification (SCA), sequential hybrid algorithm (HA1) and parallel hybrid algorithm (HA2). In the second one, we use 11 similarity measures combined with several types of character N-Grams. For the evaluation task, the proposed methods are tested on forum datasets containing 32 different languages. Furthermore, an experimental comparison is made between the proposed approaches and some referential language identification tools such as: LIGA, NTC, Google translate and Microsoft Word. Results show that the proposed approaches are interesting and outperform the baseline methods of language identification on forum texts. 
653 |a Studies 
653 |a Information retrieval 
653 |a Information processing 
653 |a Programming languages 
653 |a Statistical methods 
653 |a Algorithms 
653 |a Natural language processing 
653 |a Identification methods 
653 |a Texts 
653 |a Identification 
653 |a Language 
653 |a Data mining 
653 |a Prototypes 
653 |a Personality 
653 |a Retrieval 
653 |a Language disorders 
653 |a Language identification 
653 |a Languages 
653 |a N-Gram language models 
700 1 |a Ouamour, Siham 
700 1 |a Sayoud, Halim 
773 0 |t Information Processing & Management  |g vol. 52, no. 4 (Jul 2016), p. 491 
786 0 |d ProQuest  |t ABI/INFORM Global 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/1792389416/abstract/embedded/6A8EOT78XXH2IG52?source=fedsrch