From Dataset to Model: A Romanian–English Corpus and Fine-Tuned Cross-Lingual Embeddings for Text and Tabular Retrieval

Guardado en:
Detalles Bibliográficos
Publicado en:Applied Sciences vol. 15, no. 22 (2025), p. 12219-12233
Autor principal: Guțu Bogdan Mihai
Otros Autores: Popescu Nirvana
Publicado:
MDPI AG
Materias:
Acceso en línea:Citation/Abstract
Full Text + Graphics
Full Text - PDF
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!

MARC

LEADER 00000nab a2200000uu 4500
001 3275502206
003 UK-CbPIL
022 |a 2076-3417 
024 7 |a 10.3390/app152212219  |2 doi 
035 |a 3275502206 
045 2 |b d20250101  |b d20251231 
084 |a 231338  |2 nlm 
100 1 |a Guțu Bogdan Mihai 
245 1 |a From Dataset to Model: A Romanian–English Corpus and Fine-Tuned Cross-Lingual Embeddings for Text and Tabular Retrieval 
260 |b MDPI AG  |c 2025 
513 |a Journal Article 
520 3 |a This study introduces a Romanian–English bilingual corpus and a fine-tuned cross-lingual embedding framework aimed at improving retrieval performance in Retrieval-Augmented Generation (RAG) systems. The dataset integrates over 130,000 unstructured question–answer pairs derived from SQuAD and 9750 Romanian-generated questions linked to governmental tabular data, subsequently translated bidirectionally to build parallel Romanian–English resources. Multiple state-of-the-art embedding models, including multilingual-e5, Jina-v3, and the Qwen3-Embeddings family, were systematically evaluated on both text and tabular inputs across four language directions (eng-eng, ro-ro, eng-ro, ro-eng). The results show that while multilingual-e5-large achieved the strongest monolingual retrieval performance, Qwen3-Embedding-4B provided the best overall balance across languages and modalities. Fine-tuning this model using Low-Rank Adaptation (LoRA) and InfoNCE loss improved its Mean Reciprocal Rank (MRR) from 0.4496 to 0.4872 (+8.36%), with the largest gains observed in cross-lingual retrieval tasks. The research highlights persistent challenges in structured (tabular) data retrieval due to dataset imbalance and outlines future directions including dataset expansion, translation refinement, and instruction-based fine-tuning. Overall, this work contributes new bilingual analyses and methodological insights for advancing embedding-based retrieval in low-resource and multimodal contexts. 
653 |a Language 
653 |a Open data 
653 |a Datasets 
653 |a Hypotheses 
653 |a International relations 
653 |a Data collection 
653 |a Machine translation 
653 |a Natural language processing 
653 |a Multilingualism 
653 |a Linguistics 
653 |a Bilingualism 
653 |a Large language models 
653 |a Chatbots 
653 |a Retrieval performance measures 
700 1 |a Popescu Nirvana 
773 0 |t Applied Sciences  |g vol. 15, no. 22 (2025), p. 12219-12233 
786 0 |d ProQuest  |t Publicly Available Content Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3275502206/abstract/embedded/Q8Z64E4HU3OH5N8U?source=fedsrch 
856 4 0 |3 Full Text + Graphics  |u https://www.proquest.com/docview/3275502206/fulltextwithgraphics/embedded/Q8Z64E4HU3OH5N8U?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3275502206/fulltextPDF/embedded/Q8Z64E4HU3OH5N8U?source=fedsrch