From Dataset to Model: A Romanian–English Corpus and Fine-Tuned Cross-Lingual Embeddings for Text and Tabular Retrieval

Guardado en:

Bibliografiske detaljer
Udgivet i:	Applied Sciences vol. 15, no. 22 (2025), p. 12219-12233
Hovedforfatter:	Guțu Bogdan Mihai
Andre forfattere:	Popescu Nirvana
Udgivet:	MDPI AG
Fag:	Language Open data Datasets Hypotheses International relations Data collection Machine translation Natural language processing Multilingualism Linguistics Bilingualism Large language models Chatbots Retrieval performance measures
Online adgang:	Citation/Abstract Full Text + Graphics Full Text - PDF
Tags:	Tilføj Tag Ingen Tags, Vær først til at tagge denne postø!

Beskrivelse
Resumen:	This study introduces a Romanian–English bilingual corpus and a fine-tuned cross-lingual embedding framework aimed at improving retrieval performance in Retrieval-Augmented Generation (RAG) systems. The dataset integrates over 130,000 unstructured question–answer pairs derived from SQuAD and 9750 Romanian-generated questions linked to governmental tabular data, subsequently translated bidirectionally to build parallel Romanian–English resources. Multiple state-of-the-art embedding models, including multilingual-e5, Jina-v3, and the Qwen3-Embeddings family, were systematically evaluated on both text and tabular inputs across four language directions (eng-eng, ro-ro, eng-ro, ro-eng). The results show that while multilingual-e5-large achieved the strongest monolingual retrieval performance, Qwen3-Embedding-4B provided the best overall balance across languages and modalities. Fine-tuning this model using Low-Rank Adaptation (LoRA) and InfoNCE loss improved its Mean Reciprocal Rank (MRR) from 0.4496 to 0.4872 (+8.36%), with the largest gains observed in cross-lingual retrieval tasks. The research highlights persistent challenges in structured (tabular) data retrieval due to dataset imbalance and outlines future directions including dataset expansion, translation refinement, and instruction-based fine-tuning. Overall, this work contributes new bilingual analyses and methodological insights for advancing embedding-based retrieval in low-resource and multimodal contexts.
ISSN:	2076-3417
DOI:	10.3390/app152212219
Fuente:	Publicly Available Content Database