AI Benchmarks and Datasets for LLM Evaluation
Furkejuvvon:
| Publikašuvnnas: | arXiv.org (Dec 2, 2024), p. n/a |
|---|---|
| Váldodahkki: | |
| Eará dahkkit: | |
| Almmustuhtton: |
Cornell University Library, arXiv.org
|
| Fáttát: | |
| Liŋkkat: | Citation/Abstract Full text outside of ProQuest |
| Fáddágilkorat: |
Eai fáddágilkorat, Lasit vuosttaš fáddágilkora!
|
MARC
| LEADER | 00000nab a2200000uu 4500 | ||
|---|---|---|---|
| 001 | 3138994632 | ||
| 003 | UK-CbPIL | ||
| 022 | |a 2331-8422 | ||
| 035 | |a 3138994632 | ||
| 045 | 0 | |b d20241202 | |
| 100 | 1 | |a Ivanov, Todor | |
| 245 | 1 | |a AI Benchmarks and Datasets for LLM Evaluation | |
| 260 | |b Cornell University Library, arXiv.org |c Dec 2, 2024 | ||
| 513 | |a Working Paper | ||
| 520 | 3 | |a LLMs demand significant computational resources for both pre-training and fine-tuning, requiring distributed computing capabilities due to their large model sizes \cite{sastry2024computing}. Their complex architecture poses challenges throughout the entire AI lifecycle, from data collection to deployment and monitoring \cite{OECD_AIlifecycle}. Addressing critical AI system challenges, such as explainability, corrigibility, interpretability, and hallucination, necessitates a systematic methodology and rigorous benchmarking \cite{guldimann2024complai}. To effectively improve AI systems, we must precisely identify systemic vulnerabilities through quantitative evaluation, bolstering system trustworthiness. The enactment of the EU AI Act \cite{EUAIAct} by the European Parliament on March 13, 2024, establishing the first comprehensive EU-wide requirements for the development, deployment, and use of AI systems, further underscores the importance of tools and methodologies such as Z-Inspection. It highlights the need to enrich this methodology with practical benchmarks to effectively address the technical challenges posed by AI systems. To this end, we have launched a project that is part of the AI Safety Bulgaria initiatives \cite{AI_Safety_Bulgaria}, aimed at collecting and categorizing AI benchmarks. This will enable practitioners to identify and utilize these benchmarks throughout the AI system lifecycle. | |
| 651 | 4 | |a Bulgaria | |
| 653 | |a Data collection | ||
| 653 | |a Distributed processing | ||
| 653 | |a Benchmarks | ||
| 700 | 1 | |a Penchev, Valeri | |
| 773 | 0 | |t arXiv.org |g (Dec 2, 2024), p. n/a | |
| 786 | 0 | |d ProQuest |t Engineering Database | |
| 856 | 4 | 1 | |3 Citation/Abstract |u https://www.proquest.com/docview/3138994632/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch |
| 856 | 4 | 0 | |3 Full text outside of ProQuest |u http://arxiv.org/abs/2412.01020 |