ExecRepoBench: Multi-level Executable Code Completion Evaluation

محفوظ في:
التفاصيل البيبلوغرافية
الحاوية / القاعدة:arXiv.org (Dec 16, 2024), p. n/a
المؤلف الرئيسي: Yang, Jian
مؤلفون آخرون: Zhang, Jiajun, Yang, Jiaxi, Jin, Ke, Zhang, Lei, Peng, Qiyao, Deng, Ken, Miao, Yibo, Liu, Tianyu, Cui, Zeyu, Binyuan Hui, Lin, Junyang
منشور في:
Cornell University Library, arXiv.org
الموضوعات:
الوصول للمادة أونلاين:Citation/Abstract
Full text outside of ProQuest
الوسوم: إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!

MARC

LEADER 00000nab a2200000uu 4500
001 3145904181
003 UK-CbPIL
022 |a 2331-8422 
035 |a 3145904181 
045 0 |b d20241216 
100 1 |a Yang, Jian 
245 1 |a ExecRepoBench: Multi-level Executable Code Completion Evaluation 
260 |b Cornell University Library, arXiv.org  |c Dec 16, 2024 
513 |a Working Paper 
520 3 |a Code completion has become an essential tool for daily software development. Existing evaluation benchmarks often employ static methods that do not fully capture the dynamic nature of real-world coding environments and face significant challenges, including limited context length, reliance on superficial evaluation metrics, and potential overfitting to training datasets. In this work, we introduce a novel framework for enhancing code completion in software development through the creation of a repository-level benchmark ExecRepoBench and the instruction corpora Repo-Instruct, aim at improving the functionality of open-source large language models (LLMs) in real-world coding scenarios that involve complex interdependencies across multiple files. ExecRepoBench includes 1.2K samples from active Python repositories. Plus, we present a multi-level grammar-based completion methodology conditioned on the abstract syntax tree to mask code fragments at various logical units (e.g. statements, expressions, and functions). Then, we fine-tune the open-source LLM with 7B parameters on Repo-Instruct to produce a strong code completion baseline model Qwen2.5-Coder-Instruct-C based on the open-source model. Qwen2.5-Coder-Instruct-C is rigorously evaluated against existing benchmarks, including MultiPL-E and ExecRepoBench, which consistently outperforms prior baselines across all programming languages. The deployment of \ourmethod{} can be used as a high-performance, local service for programming development\footnote{\url{https://execrepobench.github.io/}}. 
653 |a Repositories 
653 |a Python 
653 |a Source code 
653 |a Software development 
653 |a Large language models 
653 |a Coders 
653 |a Open source software 
653 |a Programming languages 
653 |a Benchmarks 
653 |a Coding 
700 1 |a Zhang, Jiajun 
700 1 |a Yang, Jiaxi 
700 1 |a Jin, Ke 
700 1 |a Zhang, Lei 
700 1 |a Peng, Qiyao 
700 1 |a Deng, Ken 
700 1 |a Miao, Yibo 
700 1 |a Liu, Tianyu 
700 1 |a Cui, Zeyu 
700 1 |a Binyuan Hui 
700 1 |a Lin, Junyang 
773 0 |t arXiv.org  |g (Dec 16, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3145904181/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2412.11990