Virtual Patients Using Large Language Models: Scalable, Contextualized Simulation of Clinician-Patient Dialogue With Feedback

Guardat en:
Dades bibliogràfiques
Publicat a:Journal of Medical Internet Research vol. 27 (2025), p. e68486
Autor principal: Cook, David A
Altres autors: Overgaard, Joshua, Pankratz, V Shane, Guilherme Del Fiol, Aakre, Chris A
Publicat:
Gunther Eysenbach MD MPH, Associate Professor
Matèries:
Accés en línia:Citation/Abstract
Full Text + Graphics
Full Text - PDF
Etiquetes: Afegir etiqueta
Sense etiquetes, Sigues el primer a etiquetar aquest registre!

MARC

LEADER 00000nab a2200000uu 4500
001 3222369268
003 UK-CbPIL
022 |a 1438-8871 
024 7 |a 10.2196/68486  |2 doi 
035 |a 3222369268 
045 2 |b d20250101  |b d20251231 
100 1 |a Cook, David A 
245 1 |a Virtual Patients Using Large Language Models: Scalable, Contextualized Simulation of Clinician-Patient Dialogue With Feedback 
260 |b Gunther Eysenbach MD MPH, Associate Professor  |c 2025 
513 |a Journal Article 
520 3 |a Background:Virtual patients (VPs) are computer screen–based simulations of patient-clinician encounters. VP use is limited by cost and low scalability.Objective:We aimed to show that VPs powered by large language models (LLMs) can generate authentic dialogues, accurately represent patient preferences, and provide personalized feedback on clinical performance. We also explored using LLMs to rate the quality of dialogues and feedback.Methods:We conducted an intrinsic evaluation study rating 60 VP-clinician conversations. We used carefully engineered prompts to direct OpenAI’s generative pretrained transformer (GPT) to emulate a patient and provide feedback. Using 2 outpatient medicine topics (chronic cough diagnosis and diabetes management), each with permutations representing different patient preferences, we created 60 conversations (dialogues plus feedback): 48 with a human clinician and 12 “self-chat” dialogues with GPT role-playing both the VP and clinician. Primary outcomes were dialogue authenticity and feedback quality, rated using novel instruments for which we conducted a validation study collecting evidence of content, internal structure (reproducibility), relations with other variables, and response process. Each conversation was rated by 3 physicians and by GPT. Secondary outcomes included user experience, bias, patient preferences represented in the dialogues, and conversation features that influenced authenticity.Results:The average cost per conversation was US $0.51 for GPT-4.0-Turbo and US $0.02 for GPT-3.5-Turbo. Mean (SD) conversation ratings, maximum 6, were overall dialogue authenticity 4.7 (0.7), overall user experience 4.9 (0.7), and average feedback quality 4.7 (0.6). For dialogues created using GPT-4.0-Turbo, physician ratings of patient preferences aligned with intended preferences in 20 to 47 of 48 dialogues (42%-98%). Subgroup comparisons revealed higher ratings for dialogues using GPT-4.0-Turbo versus GPT-3.5-Turbo and for human-generated versus self-chat dialogues. Feedback ratings were similar for human-generated versus GPT-generated ratings, whereas authenticity ratings were lower. We did not perceive bias in any conversation. Dialogue features that detracted from authenticity included that GPT was verbose or used atypical vocabulary (93/180, 51.7% of conversations), was overly agreeable (n=56, 31%), repeated the question as part of the response (n=47, 26%), was easily convinced by clinician suggestions (n=35, 19%), or was not disaffected by poor clinician performance (n=32, 18%). For feedback, detractors included excessively positive feedback (n=42, 23%), failure to mention important weaknesses or strengths (n=41, 23%), or factual inaccuracies (n=39, 22%). Regarding validation of dialogue and feedback scores, items were meticulously developed (content evidence), and we confirmed expected relations with other variables (higher ratings for advanced LLMs and human-generated dialogues). Reproducibility was suboptimal, due largely to variation in LLM performance rather than rater idiosyncrasies.Conclusions:LLM-powered VPs can simulate patient-clinician dialogues, demonstrably represent patient preferences, and provide personalized performance feedback. This approach is scalable, globally accessible, and inexpensive. LLM-generated ratings of feedback quality are similar to human ratings. 
610 4 |a OpenAI 
651 4 |a United States--US 
653 |a Feedback 
653 |a Diabetes 
653 |a Ratings & rankings 
653 |a Management decisions 
653 |a Application programming interface 
653 |a Medical diagnosis 
653 |a Verbal communication 
653 |a Cough reflex 
653 |a Internal medicine 
653 |a Disease management 
653 |a Physicians 
653 |a Simulated clients 
653 |a Medical students 
653 |a Authenticity 
653 |a Internet 
653 |a Reproducibility 
653 |a Vocabulary 
653 |a Permutations 
653 |a Role playing 
653 |a Patients 
653 |a Humans 
653 |a Artificial intelligence 
653 |a Standardized patients 
653 |a Simulation 
653 |a Multimedia 
653 |a Validation studies 
653 |a Bias 
653 |a Preferences 
653 |a Large language models 
653 |a Treatment preferences 
653 |a Customization 
653 |a Models 
653 |a Predicate 
653 |a Variables 
653 |a Human-computer interaction 
653 |a Validity 
653 |a Medical personnel 
653 |a Conversation 
653 |a Chat 
653 |a Medicine 
653 |a Dialogue 
653 |a Language modeling 
700 1 |a Overgaard, Joshua 
700 1 |a Pankratz, V Shane 
700 1 |a Guilherme Del Fiol 
700 1 |a Aakre, Chris A 
773 0 |t Journal of Medical Internet Research  |g vol. 27 (2025), p. e68486 
786 0 |d ProQuest  |t Library Science Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3222369268/abstract/embedded/7BTGNMKEMPT1V9Z2?source=fedsrch 
856 4 0 |3 Full Text + Graphics  |u https://www.proquest.com/docview/3222369268/fulltextwithgraphics/embedded/7BTGNMKEMPT1V9Z2?source=fedsrch 
856 4 0 |3 Full Text - PDF  |u https://www.proquest.com/docview/3222369268/fulltextPDF/embedded/7BTGNMKEMPT1V9Z2?source=fedsrch