Automatic induction of language model data for a spoken dialogue system

Guardado en:

Detalles Bibliográficos
Publicado en:	Language Resources and Evaluation vol. 40, no. 1 (Feb 2006), p. 25-46
Autor principal:	Wang, Chao
Otros Autores:	Chung, Grace, Seneff, Stephanie
Publicado:	Springer Nature B.V.
Materias:	Massachusetts Institute of Technology Models Linguistics Discourse analysis Semantics Data analysis Syntax semantics relationship Sampling methods Corpus analysis Corpus linguistics Statistical analysis Computer simulation Logic Training Sentences Error analysis Human-computer interaction Simulation Utterances Probability Induction Sampling Data Restaurants
Acceso en línea:	Citation/Abstract Full Text Full Text - PDF
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

Descripción
Resumen:	In this paper, we address the issue of generating in-domain language model training data when little or no real user data are available. The two-stage approach taken begins with a data induction phase whereby linguistic constructs from out-of-domain sentences are harvested and integrated with artificially constructed in-domain phrases. After some syntactic and semantic filtering, a large corpus of synthetically assembled user utterances is induced. In the second stage, two sampling methods are explored to filter the synthetic corpus to achieve a desired probability distribution of the semantic content, both on the sentence level and on the class level. The first method utilizes user simulation technology, which obtains the probability model via an interplay between a probabilistic user model and the dialogue system. The second method synthesizes novel dialogue interactions from the raw data by modelling after a small set of dialogues produced by the developers during the course of system refinement. Evaluation is conducted on recognition performance in a restaurant information domain. We show that a partial match to usage-appropriate semantic content distribution can be achieved via user simulations. Furthermore, word error rate can be reduced when limited amounts of in-domain training data are augmented with synthetic data derived by our methods. [PUBLICATION ABSTRACT]
ISSN:	1574-020X 1574-0218 0010-4817
DOI:	10.1007/s10579-006-9007-3
Fuente:	Arts & Humanities Database