LLM-based automatic short answer grading in undergraduate medical education

Salvato in:
Dettagli Bibliografici
Pubblicato in:BMC Medical Education vol. 24 (2024), p. 1
Autore principale: Grévisse, Christian
Pubblicazione:
Springer Nature B.V.
Soggetti:
Accesso online:Citation/Abstract
Full Text
Full Text - PDF
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!

MARC

LEADER 00000nab a2200000uu 4500
001 3115122456
003 UK-CbPIL
022 |a 1472-6920 
024 7 |a 10.1186/s12909-024-06026-5  |2 doi 
035 |a 3115122456 
045 2 |b d20240101  |b d20241231 
084 |a 58506  |2 nlm 
100 1 |a Grévisse, Christian 
245 1 |a LLM-based automatic short answer grading in undergraduate medical education 
260 |b Springer Nature B.V.  |c 2024 
513 |a Journal Article 
520 3 |a BackgroundMultiple choice questions are heavily used in medical education assessments, but rely on recognition instead of knowledge recall. However, grading open questions is a time-intensive task for teachers. Automatic short answer grading (ASAG) has tried to fill this gap, and with the recent advent of Large Language Models (LLM), this branch has seen a new momentum.MethodsWe graded 2288 student answers from 12 undergraduate medical education courses in 3 languages using GPT-4 and Gemini 1.0 Pro.ResultsGPT-4 proposed significantly lower grades than the human evaluator, but reached low rates of false positives. The grades of Gemini 1.0 Pro were not significantly different from the teachers’. Both LLMs reached a moderate agreement with human grades, and a high precision for GPT-4 among answers considered fully correct. A consistent grading behavior could be determined for high-quality keys. A weak correlation was found wrt. the length or language of student answers. There is a risk of bias if the LLM knows the human grade a priori.ConclusionsLLM-based ASAG applied to medical education still requires human oversight, but time can be spared on the edge cases, allowing teachers to focus on the middle ones. For Bachelor-level medical education questions, the training knowledge of LLMs seems to be sufficient, fine-tuning is thus not necessary. 
610 4 |a University of Luxembourg 
653 |a Language 
653 |a Students 
653 |a Medical education 
653 |a Knowledge 
653 |a Hallucinations 
653 |a Python 
653 |a Large language models 
653 |a Learning 
653 |a Natural language 
653 |a Multiple choice 
653 |a Computer Oriented Programs 
653 |a Feedback (Response) 
653 |a Climate 
653 |a Natural Language Processing 
653 |a Student Records 
653 |a Educational Assessment 
653 |a Formative Evaluation 
653 |a Evaluators 
653 |a Summative Evaluation 
653 |a Language Processing 
653 |a Educational Technology 
653 |a Concept Mapping 
653 |a Grading 
653 |a Efficiency 
773 0 |t BMC Medical Education  |g vol. 24 (2024), p. 1 
786 0 |d ProQuest  |t Healthcare Administration Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3115122456/abstract/embedded/H09TXR3UUZB2ISDL?source=fedsrch 
856 4 0 |3 Full Text  |u https://www.proquest.com/docview/3115122456/fulltext/embedded/H09TXR3UUZB2ISDL?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3115122456/fulltextPDF/embedded/H09TXR3UUZB2ISDL?source=fedsrch