AI Benchmarks and Datasets for LLM Evaluation

সংরক্ষণ করুন:
গ্রন্থ-পঞ্জীর বিবরন
প্রকাশিত:arXiv.org (Dec 2, 2024), p. n/a
প্রধান লেখক: Ivanov, Todor
অন্যান্য লেখক: Penchev, Valeri
প্রকাশিত:
Cornell University Library, arXiv.org
বিষয়গুলি:
অনলাইন ব্যবহার করুন:Citation/Abstract
Full text outside of ProQuest
ট্যাগগুলো: ট্যাগ যুক্ত করুন
কোনো ট্যাগ নেই, প্রথমজন হিসাবে ট্যাগ করুন!
বিবরন
সার সংক্ষেপ:LLMs demand significant computational resources for both pre-training and fine-tuning, requiring distributed computing capabilities due to their large model sizes \cite{sastry2024computing}. Their complex architecture poses challenges throughout the entire AI lifecycle, from data collection to deployment and monitoring \cite{OECD_AIlifecycle}. Addressing critical AI system challenges, such as explainability, corrigibility, interpretability, and hallucination, necessitates a systematic methodology and rigorous benchmarking \cite{guldimann2024complai}. To effectively improve AI systems, we must precisely identify systemic vulnerabilities through quantitative evaluation, bolstering system trustworthiness. The enactment of the EU AI Act \cite{EUAIAct} by the European Parliament on March 13, 2024, establishing the first comprehensive EU-wide requirements for the development, deployment, and use of AI systems, further underscores the importance of tools and methodologies such as Z-Inspection. It highlights the need to enrich this methodology with practical benchmarks to effectively address the technical challenges posed by AI systems. To this end, we have launched a project that is part of the AI Safety Bulgaria initiatives \cite{AI_Safety_Bulgaria}, aimed at collecting and categorizing AI benchmarks. This will enable practitioners to identify and utilize these benchmarks throughout the AI system lifecycle.
আইএসএসএন:2331-8422
সম্পদ:Engineering Database