-
Randomized Controlled Trial
Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial.
Why care about LLM's?
Large language models (LLMs) have revolutionised natural language processing, and so inevitably have found their way into healthcare. Their use in decision support and diagnosis has however shown mixed results, even as models and integrations quickly improve.
Despite short-comings, LLMs cannot be ignored by doctors – growing health cost-demand-challenges will continue to push LLM-based tools into clinical practice, even before robust clinical validation. We also know that diagnostic errors are common and costly, both in economic and patient safety terms, increasing the allure of medical LLMs.
What did this study do?
This single-blinded randomised controlled trial included 50 physicians (26 attendings, 24 residents) from family medicine, internal medicine, and emergency medicine. Participants were randomised to either use ChatGPT-4 plus conventional resources or conventional resources only, to complete up to six clinical diagnostic cases within 60 minutes.
Diagnostic performance was measured using validated standardised scoring of three elements: accuracy of generated differential diagnoses, ability to identify supporting and contradicting clinical findings, and the appropriateness of proposed next diagnostic steps.
(Interesting aside: the six selected vignettes were from a 1994 pool of 105 never-published real patient cases originally used in a landmark study on diagnostic systems, guaranteed to be outside the LLM's training data, as these cases have been kept private to preserve their future testing validity.)
And they found?
The LLM alone performed significantly better than either physician group, scoring 16 percentage points higher than the control group (95% CI, 2-30 %-points). Yet physicians with access to the LLM effectively showed no improvement compared to the conventional-resources-alone group (76% vs 74% median diagnostic score, p=.60). Time spent per case was no different between groups.
"Access alone to LLMs will not improve overall physician diagnostic reasoning in practice. These findings are particularly relevant now that many health systems offer [HIPAA]–compliant chatbots ... often with no to minimal training..."
Bottom-line
This study highlights the "implementation gap" between AI capability and clinical utility: even if reliably and consistently accurate (a big 'if'), the mere availability of AI tools will not automatically translate into improved clinical reasoning. Successful integration will require deliberate consideration of how to optimise human-AI collaboration in medical practice.
summary- Ethan Goh, Robert Gallo, Jason Hom, Eric Strong, Yingjie Weng, Hannah Kerman, Joséphine A Cool, Zahir Kanjee, Andrew S Parsons, Neera Ahuja, Eric Horvitz, Daniel Yang, Arnold Milstein, OlsonAndrew P JAPJDepartment of Hospital Medicine, University of Minnesota Medical School, Minneapolis., Adam Rodman, and Jonathan H Chen.
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California.
- JAMA Netw Open. 2024 Oct 1; 7 (10): e2440969e2440969.
ImportanceLarge language models (LLMs) have shown promise in their performance on both multiple-choice and open-ended medical reasoning examinations, but it remains unknown whether the use of such tools improves physician diagnostic reasoning.ObjectiveTo assess the effect of an LLM on physicians' diagnostic reasoning compared with conventional resources.Design, Setting, And ParticipantsA single-blind randomized clinical trial was conducted from November 29 to December 29, 2023. Using remote video conferencing and in-person participation across multiple academic medical institutions, physicians with training in family medicine, internal medicine, or emergency medicine were recruited.InterventionParticipants were randomized to either access the LLM in addition to conventional diagnostic resources or conventional resources only, stratified by career stage. Participants were allocated 60 minutes to review up to 6 clinical vignettes.Main Outcomes And MeasuresThe primary outcome was performance on a standardized rubric of diagnostic performance based on differential diagnosis accuracy, appropriateness of supporting and opposing factors, and next diagnostic evaluation steps, validated and graded via blinded expert consensus. Secondary outcomes included time spent per case (in seconds) and final diagnosis accuracy. All analyses followed the intention-to-treat principle. A secondary exploratory analysis evaluated the standalone performance of the LLM by comparing the primary outcomes between the LLM alone group and the conventional resource group.ResultsFifty physicians (26 attendings, 24 residents; median years in practice, 3 [IQR, 2-8]) participated virtually as well as at 1 in-person site. The median diagnostic reasoning score per case was 76% (IQR, 66%-87%) for the LLM group and 74% (IQR, 63%-84%) for the conventional resources-only group, with an adjusted difference of 2 percentage points (95% CI, -4 to 8 percentage points; P = .60). The median time spent per case for the LLM group was 519 (IQR, 371-668) seconds, compared with 565 (IQR, 456-788) seconds for the conventional resources group, with a time difference of -82 (95% CI, -195 to 31; P = .20) seconds. The LLM alone scored 16 percentage points (95% CI, 2-30 percentage points; P = .03) higher than the conventional resources group.Conclusions And RelevanceIn this trial, the availability of an LLM to physicians as a diagnostic aid did not significantly improve clinical reasoning compared with conventional resources. The LLM alone demonstrated higher performance than both physician groups, indicating the need for technology and workforce development to realize the potential of physician-artificial intelligence collaboration in clinical practice.Trial RegistrationClinicalTrials.gov Identifier: NCT06157944.
Notes
Why care about LLM's?
Large language models (LLMs) have revolutionised natural language processing, and so inevitably have found their way into healthcare. Their use in decision support and diagnosis has however shown mixed results, even as models and integrations quickly improve.
Despite short-comings, LLMs cannot be ignored by doctors – growing health cost-demand-challenges will continue to push LLM-based tools into clinical practice, even before robust clinical validation. We also know that diagnostic errors are common and costly, both in economic and patient safety terms, increasing the allure of medical LLMs.
What did this study do?
This single-blinded randomised controlled trial included 50 physicians (26 attendings, 24 residents) from family medicine, internal medicine, and emergency medicine. Participants were randomised to either use ChatGPT-4 plus conventional resources or conventional resources only, to complete up to six clinical diagnostic cases within 60 minutes.
Diagnostic performance was measured using validated standardised scoring of three elements: accuracy of generated differential diagnoses, ability to identify supporting and contradicting clinical findings, and the appropriateness of proposed next diagnostic steps.
(Interesting aside: the six selected vignettes were from a 1994 pool of 105 never-published real patient cases originally used in a landmark study on diagnostic systems, guaranteed to be outside the LLM's training data, as these cases have been kept private to preserve their future testing validity.)
And they found?
The LLM alone performed significantly better than either physician group, scoring 16 percentage points higher than the control group (95% CI, 2-30 %-points). Yet physicians with access to the LLM effectively showed no improvement compared to the conventional-resources-alone group (76% vs 74% median diagnostic score, p=.60). Time spent per case was no different between groups.
"Access alone to LLMs will not improve overall physician diagnostic reasoning in practice. These findings are particularly relevant now that many health systems offer [HIPAA]–compliant chatbots ... often with no to minimal training..."
Bottom-line
This study highlights the "implementation gap" between AI capability and clinical utility: even if reliably and consistently accurate (a big 'if'), the mere availability of AI tools will not automatically translate into improved clinical reasoning. Successful integration will require deliberate consideration of how to optimise human-AI collaboration in medical practice.
Knowledge, pearl, summary or comment to share?You can also include formatting, links, images and footnotes in your notes
- Simple formatting can be added to notes, such as
*italics*
,_underline_
or**bold**
. - Superscript can be denoted by
<sup>text</sup>
and subscript<sub>text</sub>
. - Numbered or bulleted lists can be created using either numbered lines
1. 2. 3.
, hyphens-
or asterisks*
. - Links can be included with:
[my link to pubmed](http://pubmed.com)
- Images can be included with:
![alt text](https://bestmedicaljournal.com/study_graph.jpg "Image Title Text")
- For footnotes use
[^1](This is a footnote.)
inline. - Or use an inline reference
[^1]
to refer to a longer footnote elseweher in the document[^1]: This is a long footnote.
.