From Dataset to Model: A Romanian–English Corpus and Fine-Tuned Cross-Lingual Embeddings for Text and Tabular Retrieval
Guardado en:
| Publicado en: | Applied Sciences vol. 15, no. 22 (2025), p. 12219-12233 |
|---|---|
| Autor principal: | |
| Otros Autores: | |
| Publicado: |
MDPI AG
|
| Materias: | |
| Acceso en línea: | Citation/Abstract Full Text + Graphics Full Text - PDF |
| Etiquetas: |
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
MARC
| LEADER | 00000nab a2200000uu 4500 | ||
|---|---|---|---|
| 001 | 3275502206 | ||
| 003 | UK-CbPIL | ||
| 022 | |a 2076-3417 | ||
| 024 | 7 | |a 10.3390/app152212219 |2 doi | |
| 035 | |a 3275502206 | ||
| 045 | 2 | |b d20250101 |b d20251231 | |
| 084 | |a 231338 |2 nlm | ||
| 100 | 1 | |a Guțu Bogdan Mihai | |
| 245 | 1 | |a From Dataset to Model: A Romanian–English Corpus and Fine-Tuned Cross-Lingual Embeddings for Text and Tabular Retrieval | |
| 260 | |b MDPI AG |c 2025 | ||
| 513 | |a Journal Article | ||
| 520 | 3 | |a This study introduces a Romanian–English bilingual corpus and a fine-tuned cross-lingual embedding framework aimed at improving retrieval performance in Retrieval-Augmented Generation (RAG) systems. The dataset integrates over 130,000 unstructured question–answer pairs derived from SQuAD and 9750 Romanian-generated questions linked to governmental tabular data, subsequently translated bidirectionally to build parallel Romanian–English resources. Multiple state-of-the-art embedding models, including multilingual-e5, Jina-v3, and the Qwen3-Embeddings family, were systematically evaluated on both text and tabular inputs across four language directions (eng-eng, ro-ro, eng-ro, ro-eng). The results show that while multilingual-e5-large achieved the strongest monolingual retrieval performance, Qwen3-Embedding-4B provided the best overall balance across languages and modalities. Fine-tuning this model using Low-Rank Adaptation (LoRA) and InfoNCE loss improved its Mean Reciprocal Rank (MRR) from 0.4496 to 0.4872 (+8.36%), with the largest gains observed in cross-lingual retrieval tasks. The research highlights persistent challenges in structured (tabular) data retrieval due to dataset imbalance and outlines future directions including dataset expansion, translation refinement, and instruction-based fine-tuning. Overall, this work contributes new bilingual analyses and methodological insights for advancing embedding-based retrieval in low-resource and multimodal contexts. | |
| 653 | |a Language | ||
| 653 | |a Open data | ||
| 653 | |a Datasets | ||
| 653 | |a Hypotheses | ||
| 653 | |a International relations | ||
| 653 | |a Data collection | ||
| 653 | |a Machine translation | ||
| 653 | |a Natural language processing | ||
| 653 | |a Multilingualism | ||
| 653 | |a Linguistics | ||
| 653 | |a Bilingualism | ||
| 653 | |a Large language models | ||
| 653 | |a Chatbots | ||
| 653 | |a Retrieval performance measures | ||
| 700 | 1 | |a Popescu Nirvana | |
| 773 | 0 | |t Applied Sciences |g vol. 15, no. 22 (2025), p. 12219-12233 | |
| 786 | 0 | |d ProQuest |t Publicly Available Content Database | |
| 856 | 4 | 1 | |3 Citation/Abstract |u https://www.proquest.com/docview/3275502206/abstract/embedded/Q8Z64E4HU3OH5N8U?source=fedsrch |
| 856 | 4 | 0 | |3 Full Text + Graphics |u https://www.proquest.com/docview/3275502206/fulltextwithgraphics/embedded/Q8Z64E4HU3OH5N8U?source=fedsrch |
| 856 | 4 | 0 | |3 Full Text - PDF |u https://www.proquest.com/docview/3275502206/fulltextPDF/embedded/Q8Z64E4HU3OH5N8U?source=fedsrch |