Effective language identification of forum texts based on statistical approaches
Enregistré dans:
| Publié dans: | Information Processing & Management vol. 52, no. 4 (Jul 2016), p. 491 |
|---|---|
| Auteur principal: | |
| Autres auteurs: | , |
| Publié: |
Elsevier Science Ltd.
|
| Sujets: | |
| Accès en ligne: | Citation/Abstract |
| Tags: |
Pas de tags, Soyez le premier à ajouter un tag!
|
MARC
| LEADER | 00000nab a2200000uu 4500 | ||
|---|---|---|---|
| 001 | 1792389416 | ||
| 003 | UK-CbPIL | ||
| 022 | |a 0306-4573 | ||
| 022 | |a 1873-5371 | ||
| 022 | |a 0020-0271 | ||
| 024 | 7 | |a 10.1016/j.ipm.2015.12.003 |2 doi | |
| 035 | |a 1792389416 | ||
| 045 | 2 | |b d20160701 |b d20160731 | |
| 084 | |a 8483 |2 nlm | ||
| 100 | 1 | |a Abainia, Kheireddine | |
| 245 | 1 | |a Effective language identification of forum texts based on statistical approaches | |
| 260 | |b Elsevier Science Ltd. |c Jul 2016 | ||
| 513 | |a Journal Article | ||
| 520 | 3 | |a This investigation deals with the problem of language identification of noisy texts, which could represent the primary step of many natural language processing or information retrieval tasks. Language identification is the task of automatically identifying the language of a given text. Although there exists several methods in the literature, their performances are not so convincing in practice. In this contribution, we propose two statistical approaches: the high frequency approach and the nearest prototype approach. In the first one, 5 algorithms of language identification are proposed and implemented, namely: character based identification (CBA), word based identification (WBA), special characters based identification (SCA), sequential hybrid algorithm (HA1) and parallel hybrid algorithm (HA2). In the second one, we use 11 similarity measures combined with several types of character N-Grams. For the evaluation task, the proposed methods are tested on forum datasets containing 32 different languages. Furthermore, an experimental comparison is made between the proposed approaches and some referential language identification tools such as: LIGA, NTC, Google translate and Microsoft Word. Results show that the proposed approaches are interesting and outperform the baseline methods of language identification on forum texts. | |
| 653 | |a Studies | ||
| 653 | |a Information retrieval | ||
| 653 | |a Information processing | ||
| 653 | |a Programming languages | ||
| 653 | |a Statistical methods | ||
| 653 | |a Algorithms | ||
| 653 | |a Natural language processing | ||
| 653 | |a Identification methods | ||
| 653 | |a Texts | ||
| 653 | |a Identification | ||
| 653 | |a Language | ||
| 653 | |a Data mining | ||
| 653 | |a Prototypes | ||
| 653 | |a Personality | ||
| 653 | |a Retrieval | ||
| 653 | |a Language disorders | ||
| 653 | |a Language identification | ||
| 653 | |a Languages | ||
| 653 | |a N-Gram language models | ||
| 700 | 1 | |a Ouamour, Siham | |
| 700 | 1 | |a Sayoud, Halim | |
| 773 | 0 | |t Information Processing & Management |g vol. 52, no. 4 (Jul 2016), p. 491 | |
| 786 | 0 | |d ProQuest |t ABI/INFORM Global | |
| 856 | 4 | 1 | |3 Citation/Abstract |u https://www.proquest.com/docview/1792389416/abstract/embedded/6A8EOT78XXH2IG52?source=fedsrch |