Cross-Lingual Language Modeling: Methods and Applications

Guardado en:
Detalles Bibliográficos
Publicado en:PQDT - Global (2021)
Autor principal: Lee, Grandee
Publicado:
ProQuest Dissertations & Theses
Materias:
Acceso en línea:Citation/Abstract
Full Text - PDF
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
Descripción
Resumen:To connect the numerous languages around the world and enable systems to bring these languages together, we need cross-lingual learning that addresses the key challenges in this endeavour. Cross-lingual learning generally brings monolingual systems together into a multi-lingual space through which the system can realize cross-lingual transfer in tasks like zero-shot POS tagging, retrieval or classification. It also enables applications like the modeling of code-switching languages. These applications present different challenges, such as the data is often small and in terms of the code-switching domain, the data is sparse. A more systematic challenge is the performance degradation related to the language distance and the difference in language structure and domain. Therefore, this thesis aims to investigate the principles underlying cross-lingual learning, and how novel methods that adhere to the cross-lingual principles can address the above-mentioned challenges in a more data-efficient manner.Firstly, to understand cross-lingual learning, we propose an information-theoretic framework that is grounded in the established field of decipherment. We systematically explore the different factors that will contribute to the language decipherment difficulty and identified lexical granularity, language order as well as data distribution to be the main factors. The proposed decipherment perspective also leads us to the conclusion that unsupervised cross-lingual learning has an inherent limit that is significantly affected by the decipherment conditions. Most distant languages are difficult to decipher which prompt us to use suitable cross-lingual signals to bridge this distributional divergence. We use bilingual dictionaries as a case study and draw important insights. These lessons learnt are instrumental in the subsequent analysis of the methods and data used for cross-lingual learning.Secondly, based on the proposed guidelines, we delve into the linguistically motivated data augmentation method for cross-lingual learning. Natural codemixed data is really important for code-switching language modeling however it is extremely low-resourced since it is in the spoken domain. We proposed a computation method to generate synthetic code-switching data using the Matrix Language Frame theory. Using this pseudo-code-switching data we are able to advance the state-of-the-art performance in code-switching language modeling. The method not only considers the word-level alignment, but also the phrase-level correspondence which can be adapted to each language based on the commonly used n-grams. This consideration combined with the switching fraction mechanism controls the complexity of the code-mixed output which in turn can be tuned for optimal downstream performance.Thirdly, we propose a novel neural back-off scheme in the language model. The back-off scheme relies on a broader class category when the data on finer lexical categories is lacking. The model can effectively tackle the data sparsity issue in code-switching since monolingual data contains no switching points and code-switching data is low-resourced. The improvement in the perplexity of code-switching language modeling demonstrates that the method is able to better utilize the given information and makes more reliable inferences even if the in-domain training data is lacking. In addition, we propose a novel bilingual attention model architecture that can effectively use the parallel corpora.
ISBN:9798352685815
Fuente:Education Database