AI Benchmarks and Datasets for LLM Evaluation

Furkejuvvon:
Bibliográfalaš dieđut
Publikašuvnnas:arXiv.org (Dec 2, 2024), p. n/a
Váldodahkki: Ivanov, Todor
Eará dahkkit: Penchev, Valeri
Almmustuhtton:
Cornell University Library, arXiv.org
Fáttát:
Liŋkkat:Citation/Abstract
Full text outside of ProQuest
Fáddágilkorat: Lasit fáddágilkoriid
Eai fáddágilkorat, Lasit vuosttaš fáddágilkora!

MARC

LEADER 00000nab a2200000uu 4500
001 3138994632
003 UK-CbPIL
022 |a 2331-8422 
035 |a 3138994632 
045 0 |b d20241202 
100 1 |a Ivanov, Todor 
245 1 |a AI Benchmarks and Datasets for LLM Evaluation 
260 |b Cornell University Library, arXiv.org  |c Dec 2, 2024 
513 |a Working Paper 
520 3 |a LLMs demand significant computational resources for both pre-training and fine-tuning, requiring distributed computing capabilities due to their large model sizes \cite{sastry2024computing}. Their complex architecture poses challenges throughout the entire AI lifecycle, from data collection to deployment and monitoring \cite{OECD_AIlifecycle}. Addressing critical AI system challenges, such as explainability, corrigibility, interpretability, and hallucination, necessitates a systematic methodology and rigorous benchmarking \cite{guldimann2024complai}. To effectively improve AI systems, we must precisely identify systemic vulnerabilities through quantitative evaluation, bolstering system trustworthiness. The enactment of the EU AI Act \cite{EUAIAct} by the European Parliament on March 13, 2024, establishing the first comprehensive EU-wide requirements for the development, deployment, and use of AI systems, further underscores the importance of tools and methodologies such as Z-Inspection. It highlights the need to enrich this methodology with practical benchmarks to effectively address the technical challenges posed by AI systems. To this end, we have launched a project that is part of the AI Safety Bulgaria initiatives \cite{AI_Safety_Bulgaria}, aimed at collecting and categorizing AI benchmarks. This will enable practitioners to identify and utilize these benchmarks throughout the AI system lifecycle. 
651 4 |a Bulgaria 
653 |a Data collection 
653 |a Distributed processing 
653 |a Benchmarks 
700 1 |a Penchev, Valeri 
773 0 |t arXiv.org  |g (Dec 2, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3138994632/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2412.01020