LLM-based automatic short answer grading in undergraduate medical education
Salvato in:
| Pubblicato in: | BMC Medical Education vol. 24 (2024), p. 1 |
|---|---|
| Autore principale: | |
| Pubblicazione: |
Springer Nature B.V.
|
| Soggetti: | |
| Accesso online: | Citation/Abstract Full Text Full Text - PDF |
| Tags: |
Nessun Tag, puoi essere il primo ad aggiungerne!!
|
MARC
| LEADER | 00000nab a2200000uu 4500 | ||
|---|---|---|---|
| 001 | 3115122456 | ||
| 003 | UK-CbPIL | ||
| 022 | |a 1472-6920 | ||
| 024 | 7 | |a 10.1186/s12909-024-06026-5 |2 doi | |
| 035 | |a 3115122456 | ||
| 045 | 2 | |b d20240101 |b d20241231 | |
| 084 | |a 58506 |2 nlm | ||
| 100 | 1 | |a Grévisse, Christian | |
| 245 | 1 | |a LLM-based automatic short answer grading in undergraduate medical education | |
| 260 | |b Springer Nature B.V. |c 2024 | ||
| 513 | |a Journal Article | ||
| 520 | 3 | |a BackgroundMultiple choice questions are heavily used in medical education assessments, but rely on recognition instead of knowledge recall. However, grading open questions is a time-intensive task for teachers. Automatic short answer grading (ASAG) has tried to fill this gap, and with the recent advent of Large Language Models (LLM), this branch has seen a new momentum.MethodsWe graded 2288 student answers from 12 undergraduate medical education courses in 3 languages using GPT-4 and Gemini 1.0 Pro.ResultsGPT-4 proposed significantly lower grades than the human evaluator, but reached low rates of false positives. The grades of Gemini 1.0 Pro were not significantly different from the teachers’. Both LLMs reached a moderate agreement with human grades, and a high precision for GPT-4 among answers considered fully correct. A consistent grading behavior could be determined for high-quality keys. A weak correlation was found wrt. the length or language of student answers. There is a risk of bias if the LLM knows the human grade a priori.ConclusionsLLM-based ASAG applied to medical education still requires human oversight, but time can be spared on the edge cases, allowing teachers to focus on the middle ones. For Bachelor-level medical education questions, the training knowledge of LLMs seems to be sufficient, fine-tuning is thus not necessary. | |
| 610 | 4 | |a University of Luxembourg | |
| 653 | |a Language | ||
| 653 | |a Students | ||
| 653 | |a Medical education | ||
| 653 | |a Knowledge | ||
| 653 | |a Hallucinations | ||
| 653 | |a Python | ||
| 653 | |a Large language models | ||
| 653 | |a Learning | ||
| 653 | |a Natural language | ||
| 653 | |a Multiple choice | ||
| 653 | |a Computer Oriented Programs | ||
| 653 | |a Feedback (Response) | ||
| 653 | |a Climate | ||
| 653 | |a Natural Language Processing | ||
| 653 | |a Student Records | ||
| 653 | |a Educational Assessment | ||
| 653 | |a Formative Evaluation | ||
| 653 | |a Evaluators | ||
| 653 | |a Summative Evaluation | ||
| 653 | |a Language Processing | ||
| 653 | |a Educational Technology | ||
| 653 | |a Concept Mapping | ||
| 653 | |a Grading | ||
| 653 | |a Efficiency | ||
| 773 | 0 | |t BMC Medical Education |g vol. 24 (2024), p. 1 | |
| 786 | 0 | |d ProQuest |t Healthcare Administration Database | |
| 856 | 4 | 1 | |3 Citation/Abstract |u https://www.proquest.com/docview/3115122456/abstract/embedded/H09TXR3UUZB2ISDL?source=fedsrch |
| 856 | 4 | 0 | |3 Full Text |u https://www.proquest.com/docview/3115122456/fulltext/embedded/H09TXR3UUZB2ISDL?source=fedsrch |
| 856 | 4 | 0 | |3 Full Text - PDF |u https://www.proquest.com/docview/3115122456/fulltextPDF/embedded/H09TXR3UUZB2ISDL?source=fedsrch |