From Dataset to Model: A Romanian–English Corpus and Fine-Tuned Cross-Lingual Embeddings for Text and Tabular Retrieval
Gardado en:
| Publicado en: | Applied Sciences vol. 15, no. 22 (2025), p. 12219-12233 |
|---|---|
| Autor Principal: | |
| Outros autores: | |
| Publicado: |
MDPI AG
|
| Materias: | |
| Acceso en liña: | Citation/Abstract Full Text + Graphics Full Text - PDF |
| Etiquetas: |
Sen Etiquetas, Sexa o primeiro en etiquetar este rexistro!
|
| Resumo: | This study introduces a Romanian–English bilingual corpus and a fine-tuned cross-lingual embedding framework aimed at improving retrieval performance in Retrieval-Augmented Generation (RAG) systems. The dataset integrates over 130,000 unstructured question–answer pairs derived from SQuAD and 9750 Romanian-generated questions linked to governmental tabular data, subsequently translated bidirectionally to build parallel Romanian–English resources. Multiple state-of-the-art embedding models, including multilingual-e5, Jina-v3, and the Qwen3-Embeddings family, were systematically evaluated on both text and tabular inputs across four language directions (eng-eng, ro-ro, eng-ro, ro-eng). The results show that while multilingual-e5-large achieved the strongest monolingual retrieval performance, Qwen3-Embedding-4B provided the best overall balance across languages and modalities. Fine-tuning this model using Low-Rank Adaptation (LoRA) and InfoNCE loss improved its Mean Reciprocal Rank (MRR) from 0.4496 to 0.4872 (+8.36%), with the largest gains observed in cross-lingual retrieval tasks. The research highlights persistent challenges in structured (tabular) data retrieval due to dataset imbalance and outlines future directions including dataset expansion, translation refinement, and instruction-based fine-tuning. Overall, this work contributes new bilingual analyses and methodological insights for advancing embedding-based retrieval in low-resource and multimodal contexts. |
|---|---|
| ISSN: | 2076-3417 |
| DOI: | 10.3390/app152212219 |
| Fonte: | Publicly Available Content Database |