Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis

Salvato in:
Dettagli Bibliografici
Pubblicato in:Journal of Medical Internet Research vol. 27 (2025), p. e64486
Autore principale: Wang, Ling
Altri autori: Li, Jinglin, Zhuang, Boyang, Huang, Shasha, Fang, Meilin, Wang, Cunze, Li, Wen, Zhang, Mohan, Gong, Shurong
Pubblicazione:
Gunther Eysenbach MD MPH, Associate Professor
Soggetti:
Accesso online:Citation/Abstract
Full Text + Graphics
Full Text - PDF
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!

MARC

LEADER 00000nab a2200000uu 4500
001 3222368409
003 UK-CbPIL
022 |a 1438-8871 
024 7 |a 10.2196/64486  |2 doi 
035 |a 3222368409 
045 2 |b d20250101  |b d20251231 
100 1 |a Wang, Ling 
245 1 |a Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis 
260 |b Gunther Eysenbach MD MPH, Associate Professor  |c 2025 
513 |a Journal Article 
520 3 |a Background:Large language models (LLMs) have flourished and gradually become an important research and application direction in the medical field. However, due to the high degree of specialization, complexity, and specificity of medicine, which results in extremely high accuracy requirements, controversy remains about whether LLMs can be used in the medical field. More studies have evaluated the performance of various types of LLMs in medicine, but the conclusions are inconsistent.Objective:This study uses a network meta-analysis (NMA) to assess the accuracy of LLMs when answering clinical research questions to provide high-level evidence-based evidence for its future development and application in the medical field.Methods:In this systematic review and NMA, we searched PubMed, Embase, Web of Science, and Scopus from inception until October 14, 2024. Studies on the accuracy of LLMs when answering clinical research questions were included and screened by reading published reports. The systematic review and NMA were conducted to compare the accuracy of different LLMs when answering clinical research questions, including objective questions, open-ended questions, top 1 diagnosis, top 3 diagnosis, top 5 diagnosis, and triage and classification. The NMA was performed using Bayesian frequency theory methods. Indirect intercomparisons between programs were performed using a grading scale. A larger surface under the cumulative ranking curve (SUCRA) value indicates a higher ranking of the corresponding LLM accuracy.Results:The systematic review and NMA examined 168 articles encompassing 35,896 questions and 3063 clinical cases. Of the 168 studies, 40 (23.8%) were considered to have a low risk of bias, 128 (76.2%) had a moderate risk, and none were rated as having a high risk. ChatGPT-4o (SUCRA=0.9207) demonstrated strong performance in terms of accuracy for objective questions, followed by Aeyeconsult (SUCRA=0.9187) and ChatGPT-4 (SUCRA=0.8087). ChatGPT-4 (SUCRA=0.8708) excelled at answering open-ended questions. In terms of accuracy for top 1 diagnosis and top 3 diagnosis of clinical cases, human experts (SUCRA=0.9001 and SUCRA=0.7126, respectively) ranked the highest, while Claude 3 Opus (SUCRA=0.9672) performed well at the top 5 diagnosis. Gemini (SUCRA=0.9649) had the highest rated SUCRA value for accuracy in the area of triage and classification.Conclusions:Our study indicates that ChatGPT-4o has an advantage when answering objective questions. For open-ended questions, ChatGPT-4 may be more credible. Humans are more accurate at the top 1 diagnosis and top 3 diagnosis. Claude 3 Opus performs better at the top 5 diagnosis, while for triage and classification, Gemini is more advantageous. This analysis offers valuable insights for clinicians and medical practitioners, empowering them to effectively leverage LLMs for improved decision-making in learning, diagnosis, and management of various clinical scenarios.Trial Registration:PROSPERO CRD42024558245; https://www.crd.york.ac.uk/PROSPERO/view/CRD42024558245 
653 |a Language 
653 |a Accuracy 
653 |a Ratings & rankings 
653 |a Medical diagnosis 
653 |a Clinical research 
653 |a Evidence based research 
653 |a Specialization 
653 |a Chatbots 
653 |a Answers 
653 |a Natural language 
653 |a Medicine 
653 |a Systematic review 
653 |a Classification 
653 |a Meta-analysis 
653 |a Multimedia 
653 |a Decision making 
653 |a Medical research 
653 |a Triage 
653 |a Databases 
653 |a High risk 
653 |a Bayesian analysis 
653 |a Medical personnel 
653 |a Large language models 
653 |a Clinical assessment 
653 |a Objectives 
653 |a Risk 
653 |a Questions 
653 |a Human-computer interaction 
653 |a Registration 
653 |a Medical decision making 
653 |a Language modeling 
653 |a Research 
653 |a Research applications 
700 1 |a Li, Jinglin 
700 1 |a Zhuang, Boyang 
700 1 |a Huang, Shasha 
700 1 |a Fang, Meilin 
700 1 |a Wang, Cunze 
700 1 |a Li, Wen 
700 1 |a Zhang, Mohan 
700 1 |a Gong, Shurong 
773 0 |t Journal of Medical Internet Research  |g vol. 27 (2025), p. e64486 
786 0 |d ProQuest  |t Library Science Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3222368409/abstract/embedded/L8HZQI7Z43R0LA5T?source=fedsrch 
856 4 0 |3 Full Text + Graphics  |u https://www.proquest.com/docview/3222368409/fulltextwithgraphics/embedded/L8HZQI7Z43R0LA5T?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3222368409/fulltextPDF/embedded/L8HZQI7Z43R0LA5T?source=fedsrch