A Comparative Study of Large Language Models in Programming Education: Accuracy, Efficiency, and Feedback in Student Assignment Grading

Saved in:
Bibliographic Details
Published in:Applied Sciences vol. 15, no. 18 (2025), p. 10055-10067
Main Author: Bernik Andrija
Other Authors: Radošević Danijel, Čep Andrej
Published:
MDPI AG
Subjects:
Online Access:Citation/Abstract
Full Text
Full Text - PDF
Tags: Add Tag
No Tags, Be the first to tag this record!

MARC

LEADER 00000nab a2200000uu 4500
001 3254469051
003 UK-CbPIL
022 |a 2076-3417 
024 7 |a 10.3390/app151810055  |2 doi 
035 |a 3254469051 
045 2 |b d20250101  |b d20251231 
084 |a 231338  |2 nlm 
100 1 |a Bernik Andrija  |u Department of Multimedia, University North, 42000 Varaždin, Croatia 
245 1 |a A Comparative Study of Large Language Models in Programming Education: Accuracy, Efficiency, and Feedback in Student Assignment Grading 
260 |b MDPI AG  |c 2025 
513 |a Journal Article 
520 3 |a Programming education traditionally requires extensive manual assessment of student assignments, which is both time-consuming and resource-intensive for instructors. Recent advances in large language models (LLMs) open opportunities for automating this process and providing timely feedback. This paper investigates the application of artificial intelligence (AI) tools for preliminary assessment of undergraduate programming assignments. A multi-phase experimental study was conducted across three computer science courses: Introduction to Programming, Programming 2, and Advanced Programming Concepts. A total of 315 Python assignments were collected from the Moodle learning management system, with 100 randomly selected submissions analyzed in detail. AI evaluation was performed using ChatGPT-4 (GPT-4-turbo), Claude 3, and Gemini 1.5 Pro models, employing structured prompts aligned with a predefined rubric that assessed functionality, code structure, documentation, and efficiency. Quantitative results demonstrate high correlation between AI-generated scores and instructor evaluations, with ChatGPT-4 achieving the highest consistency (Pearson coefficient 0.91) and the lowest average absolute deviation (0.68 points). Qualitative analysis highlights AI’s ability to provide structured, actionable feedback, though variability across models was observed. The study identifies benefits such as faster evaluation and enhanced feedback quality, alongside challenges including model limitations, potential biases, and the need for human oversight. Recommendations emphasize hybrid evaluation approaches combining AI automation with instructor supervision, ethical guidelines, and integration of AI tools into learning management systems. The findings indicate that AI-assisted grading can improve efficiency and pedagogical outcomes while maintaining academic integrity. 
653 |a Higher education 
653 |a Accuracy 
653 |a Students 
653 |a Essays 
653 |a Automation 
653 |a Artificial intelligence 
653 |a Feedback 
653 |a Peers 
653 |a Large language models 
653 |a Learning 
653 |a Teachers 
653 |a Chatbots 
700 1 |a Radošević Danijel  |u Faculty of Organization and Informatics, University of Zagreb, 42000 Varaždin, Croatia; darados@foi.hr 
700 1 |a Čep Andrej  |u Inpro d.o.o., Department for Systems Implementation, 40000 Čakovec, Croatia; andrej.cep0@gmail.com 
773 0 |t Applied Sciences  |g vol. 15, no. 18 (2025), p. 10055-10067 
786 0 |d ProQuest  |t Publicly Available Content Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3254469051/abstract/embedded/L8HZQI7Z43R0LA5T?source=fedsrch 
856 4 0 |3 Full Text  |u https://www.proquest.com/docview/3254469051/fulltext/embedded/L8HZQI7Z43R0LA5T?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3254469051/fulltextPDF/embedded/L8HZQI7Z43R0LA5T?source=fedsrch