Evaluating GPT-4o in high-stakes medical assessments: performance and error analysis on a Chilean anesthesiology exam

Guardado en:
Detalles Bibliográficos
Publicado en:BMC Medical Education vol. 25 (2025), p. 1-15
Autor principal: Altermatt, Fernando R
Otros Autores: Neyem, Andres, Sumonte, Nicolás I, Villagrán, Ignacio, Mendoza, Marcelo, Lacassie, Hector J, Delfino, Alejandro E
Publicado:
Springer Nature B.V.
Materias:
Acceso en línea:Citation/Abstract
Full Text
Full Text - PDF
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!

MARC

LEADER 00000nab a2200000uu 4500
001 3268438433
003 UK-CbPIL
022 |a 1472-6920 
024 7 |a 10.1186/s12909-025-08084-9  |2 doi 
035 |a 3268438433 
045 2 |b d20250101  |b d20251231 
084 |a 58506  |2 nlm 
100 1 |a Altermatt, Fernando R 
245 1 |a Evaluating GPT-4o in high-stakes medical assessments: performance and error analysis on a Chilean anesthesiology exam 
260 |b Springer Nature B.V.  |c 2025 
513 |a Journal Article 
520 3 |a BackgroundLarge language models (LLMs) such as GPT-4o have the potential to transform clinical decision-making, patient education, and medical research. Despite impressive performance in generating patient-friendly educational materials and assisting in clinical documentation, concerns remain regarding the reliability, subtle errors, and biases that can undermine their use in high-stakes medical settings.MethodsA multi-phase experimental design was employed to assess the performance of GPT-4o on the Chilean anesthesiology exam (CONACEM), which comprised 183 questions covering four cognitive domains—Understanding, Recall, Application, and Analysis—based on Bloom’s taxonomy. Thirty independent simulation runs were conducted with systematic variation of the model’s temperature parameter to gauge the balance between deterministic and creative responses. The generated responses underwent qualitative error analysis using a refined taxonomy that categorized errors such as “Unsupported Medical Claim,” “Hallucination of Information,” “Sticking with Wrong Diagnosis,” “Non-medical Factual Error,” “Incorrect Understanding of Task,” “Reasonable Response,” “Ignore Missing Information,” and “Incorrect or Vague Conclusion.” Two board-certified anesthesiologists performed independent annotations, with disagreements resolved by a third expert. Statistical evaluations—including one-way ANOVA, non-parametric tests, chi-square, and linear mixed-effects modeling—were used to compare performance across domains and analyze error frequency.ResultsGPT-4o achieved an overall accuracy of 83.69%. Performance varied significantly by cognitive domain, with the highest accuracy observed in the Understanding (90.10%) and Recall (84.38%) domains, and lower accuracy in Application (76.83%) and Analysis (76.54%). Among the 120 incorrect responses, unsupported medical claims were the most common error (40.69%), followed by vague or incorrect conclusions (22.07%). Co-occurrence analyses revealed that unsupported claims often appeared alongside imprecise conclusions, highlighting a trend of compounded errors particularly in tasks requiring complex reasoning. Inter-rater reliability for error annotation was robust, with a mean Cohen’s kappa of 0.73.ConclusionsWhile GPT-4o exhibits strengths in factual recall and comprehension, its limitations in handling higher-order reasoning and diagnostic judgment are evident through frequent unsupported medical claims and vague conclusions. These findings underscore the need for improved domain-specific fine-tuning, enhanced error mitigation strategies, and integrated knowledge verification mechanisms prior to clinical deployment. 
653 |a Adaptation 
653 |a Anesthesia 
653 |a Performance evaluation 
653 |a Cognition & reasoning 
653 |a Business metrics 
653 |a Statistical analysis 
653 |a Monte Carlo simulation 
653 |a Anesthesiology 
653 |a Human performance 
653 |a Temperature 
653 |a Decision making 
653 |a Taxonomy 
653 |a Creativity 
653 |a Determinism 
653 |a Error analysis 
653 |a Annotations 
653 |a Large language models 
653 |a Guidelines 
653 |a Recall (Psychology) 
653 |a Standardized Tests 
653 |a High Stakes Tests 
653 |a Sample Size 
653 |a Interrater Reliability 
653 |a Medical Education 
653 |a Measurement Techniques 
653 |a Patient Education 
653 |a Program Evaluation 
653 |a Error Patterns 
653 |a Medical Evaluation 
653 |a Nonparametric Statistics 
653 |a Simulation 
653 |a Feedback (Response) 
653 |a Climate 
653 |a Accuracy 
653 |a Effect Size 
653 |a Statistical Data 
653 |a Hypothesis Testing 
653 |a Definitions 
700 1 |a Neyem, Andres 
700 1 |a Sumonte, Nicolás I 
700 1 |a Villagrán, Ignacio 
700 1 |a Mendoza, Marcelo 
700 1 |a Lacassie, Hector J 
700 1 |a Delfino, Alejandro E 
773 0 |t BMC Medical Education  |g vol. 25 (2025), p. 1-15 
786 0 |d ProQuest  |t Healthcare Administration Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3268438433/abstract/embedded/7BTGNMKEMPT1V9Z2?source=fedsrch 
856 4 0 |3 Full Text  |u https://www.proquest.com/docview/3268438433/fulltext/embedded/7BTGNMKEMPT1V9Z2?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3268438433/fulltextPDF/embedded/7BTGNMKEMPT1V9Z2?source=fedsrch