ExecRepoBench: Multi-level Executable Code Completion Evaluation

محفوظ في:

التفاصيل البيبلوغرافية
الحاوية / القاعدة:	arXiv.org (Dec 16, 2024), p. n/a
المؤلف الرئيسي:	Yang, Jian
مؤلفون آخرون:	Zhang, Jiajun, Yang, Jiaxi, Jin, Ke, Zhang, Lei, Peng, Qiyao, Deng, Ken, Miao, Yibo, Liu, Tianyu, Cui, Zeyu, Binyuan Hui, Lin, Junyang
منشور في:	Cornell University Library, arXiv.org
الموضوعات:	Repositories Python Source code Software development Large language models Coders Open source software Programming languages Benchmarks Coding
الوصول للمادة أونلاين:	Citation/Abstract Full text outside of ProQuest
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!

MARC


LEADER	00000nab a2200000uu 4500
001	3145904181
003	UK-CbPIL
022			\|a 2331-8422
035			\|a 3145904181
045	0		\|b d20241216
100	1		\|a Yang, Jian
245	1		\|a ExecRepoBench: Multi-level Executable Code Completion Evaluation
260			\|b Cornell University Library, arXiv.org \|c Dec 16, 2024
513			\|a Working Paper
520	3		\|a Code completion has become an essential tool for daily software development. Existing evaluation benchmarks often employ static methods that do not fully capture the dynamic nature of real-world coding environments and face significant challenges, including limited context length, reliance on superficial evaluation metrics, and potential overfitting to training datasets. In this work, we introduce a novel framework for enhancing code completion in software development through the creation of a repository-level benchmark ExecRepoBench and the instruction corpora Repo-Instruct, aim at improving the functionality of open-source large language models (LLMs) in real-world coding scenarios that involve complex interdependencies across multiple files. ExecRepoBench includes 1.2K samples from active Python repositories. Plus, we present a multi-level grammar-based completion methodology conditioned on the abstract syntax tree to mask code fragments at various logical units (e.g. statements, expressions, and functions). Then, we fine-tune the open-source LLM with 7B parameters on Repo-Instruct to produce a strong code completion baseline model Qwen2.5-Coder-Instruct-C based on the open-source model. Qwen2.5-Coder-Instruct-C is rigorously evaluated against existing benchmarks, including MultiPL-E and ExecRepoBench, which consistently outperforms prior baselines across all programming languages. The deployment of \ourmethod{} can be used as a high-performance, local service for programming development\footnote{\url{https://execrepobench.github.io/}}.
653			\|a Repositories
653			\|a Python
653			\|a Source code
653			\|a Software development
653			\|a Large language models
653			\|a Coders
653			\|a Open source software
653			\|a Programming languages
653			\|a Benchmarks
653			\|a Coding
700	1		\|a Zhang, Jiajun
700	1		\|a Yang, Jiaxi
700	1		\|a Jin, Ke
700	1		\|a Zhang, Lei
700	1		\|a Peng, Qiyao
700	1		\|a Deng, Ken
700	1		\|a Miao, Yibo
700	1		\|a Liu, Tianyu
700	1		\|a Cui, Zeyu
700	1		\|a Binyuan Hui
700	1		\|a Lin, Junyang
773	0		\|t arXiv.org \|g (Dec 16, 2024), p. n/a
786	0		\|d ProQuest \|t Engineering Database
856	4	1	\|3 Citation/Abstract \|u https://www.proquest.com/docview/3145904181/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch
856	4	0	\|3 Full text outside of ProQuest \|u http://arxiv.org/abs/2412.11990