JAMA network open
-
Randomized Controlled Trial
Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial.
Why care about LLM's?
Large language models (LLMs) have revolutionised natural language processing, and so inevitably have found their way into healthcare. Their use in decision support and diagnosis has however shown mixed results, even as models and integrations quickly improve.
Despite short-comings, LLMs cannot be ignored by doctors – growing health cost-demand-challenges will continue to push LLM-based tools into clinical practice, even before robust clinical validation. We also know that diagnostic errors are common and costly, both in economic and patient safety terms, increasing the allure of medical LLMs.
What did this study do?
This single-blinded randomised controlled trial included 50 physicians (26 attendings, 24 residents) from family medicine, internal medicine, and emergency medicine. Participants were randomised to either use ChatGPT-4 plus conventional resources or conventional resources only, to complete up to six clinical diagnostic cases within 60 minutes.
Diagnostic performance was measured using validated standardised scoring of three elements: accuracy of generated differential diagnoses, ability to identify supporting and contradicting clinical findings, and the appropriateness of proposed next diagnostic steps.
(Interesting aside: the six selected vignettes were from a 1994 pool of 105 never-published real patient cases originally used in a landmark study on diagnostic systems, guaranteed to be outside the LLM's training data, as these cases have been kept private to preserve their future testing validity.)
And they found?
The LLM alone performed significantly better than either physician group, scoring 16 percentage points higher than the control group (95% CI, 2-30 %-points). Yet physicians with access to the LLM effectively showed no improvement compared to the conventional-resources-alone group (76% vs 74% median diagnostic score, p=.60). Time spent per case was no different between groups.
"Access alone to LLMs will not improve overall physician diagnostic reasoning in practice. These findings are particularly relevant now that many health systems offer [HIPAA]–compliant chatbots ... often with no to minimal training..."
Bottom-line
This study highlights the "implementation gap" between AI capability and clinical utility: even if reliably and consistently accurate (a big 'if'), the mere availability of AI tools will not automatically translate into improved clinical reasoning. Successful integration will require deliberate consideration of how to optimise human-AI collaboration in medical practice.
summary