A research case study: Difficulties and recommendations when using a textual data mining tool

Guardado en:
Detalles Bibliográficos
Publicado en:Information & Management vol. 50, no. 7 (Nov 2013), p. 540
Autor principal: Al-Hassan, Abeer A
Otros Autores: Alshameri, Faleh, Sibley, Edgar H
Publicado:
Elsevier Sequoia S.A.
Materias:
Acceso en línea:Citation/Abstract
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
Descripción
Resumen:Although many interesting results have been reported by researchers using numeric data mining methods, there are still questions that need answering before textual data mining tools will be considered generally useful due to the effort needed to learn and use them. In 2011, we generated a dataset from the legal statements (mainly privacy policy and terms of use) on the websites of 475 of the US Fortune 500 Companies and used it as input to see what we could detect about the organizational relationships between the companies by using a textual data mining tool. We hoped to find that the tool would cluster similar corporations into the same industrial sector, as validated by the company's self-reported North American Industry Classification System code (NAICS). Unfortunately, this proved only marginally successful, leading us to ask why and to pose our research question: What problems occur when a data-mining tool is used to analyze large textual datasets that are unstructured, complex, duplicative, and contain many homonyms and synonyms? In analyzing our large dataset we learned a great deal about the problem and fortunately, after significant effort, determined how to "massage" the raw dataset to improve the process and learn how the tool can be better used in research situations. We also found that NAICS, as self-reported by companies, are of dubious value to a researcher -- a matter briefly discussed. [PUBLICATION ABSTRACT]
ISSN:0378-7206
1872-7530
Fuente:ABI/INFORM Global