• Prehosp Emerg Care · Jul 2024

    The Performance of ChatGPT-4 and Gemini Ultra 1.0 for Quality Assurance Review in Emergency Medical Services Chest Pain Calls.

    • Graham Brant-Zawadzki, Brent Klapthor, Chris Ryba, Drew C Youngquist, Brooke Burton, Helen Palatinus, and Scott T Youngquist.
    • Department of Emergency Medicine, University of Utah, Salt Lake City, Utah.
    • Prehosp Emerg Care. 2024 Jul 22: 181-8.

    ObjectivesThis study assesses the feasibility, inter-rater reliability, and accuracy of using OpenAI's ChatGPT-4 and Google's Gemini Ultra large language models (LLMs), for Emergency Medical Services (EMS) quality assurance. The implementation of these LLMs for EMS quality assurance has the potential to significantly reduce the workload on medical directors and quality assurance staff by automating aspects of the processing and review of patient care reports. This offers the potential for more efficient and accurate identification of areas requiring improvement, thereby potentially enhancing patient care outcomes.MethodsTwo expert human reviewers, ChatGPT GPT-4, and Gemini Ultra assessed and rated 150 consecutively sampled and anonymized prehospital records from 2 large urban EMS agencies for adherence to 2020 National Association of State EMS metrics for cardiac care. We evaluated the accuracy of scoring, inter-rater reliability, and review efficiency. The inter-rater reliability for the dichotomous outcome of each EMS metric was measured using the kappa statistic.ResultsHuman reviewers showed high interrater reliability, with 91.2% agreement and a kappa coefficient 0.782 (0.654-0.910). ChatGPT-4 achieved substantial agreement with human reviewers in EKG documentation and aspirin administration (76.2% agreement, kappa coefficient 0.401 (0.334-0.468), but performance varied across other metrics. Gemini Ultra's evaluation was discontinued due to poor performance. No significant differences were observed in median review times: 01:28 min (IQR 1:12 - 1:51 min) per human chart review, 01:24 min (IQR 01:09 - 01:53 min) per ChatGPT-4 chart review (p = 0.46), and 01:50 min (IQR 01:10-03:34 min) per Gemini Ultra review (p = 0.06).ConclusionsLarge language models demonstrate potential in supporting quality assurance by effectively and objectively extracting data elements. However, their accuracy in interpreting non-standardized and time-sensitive details remains inferior to human evaluators. Our findings suggest that current LLMs may best offer supplemental support to the human review processes, but their current value remains limited. Enhancements in LLM training and integration are recommended for improved and more reliable performance in the quality assurance processes.

      Pubmed     Copy Citation     Plaintext  

      Add institutional full text...

    Notes

     
    Knowledge, pearl, summary or comment to share?
    300 characters remaining
    help        
    You can also include formatting, links, images and footnotes in your notes
    • Simple formatting can be added to notes, such as *italics*, _underline_ or **bold**.
    • Superscript can be denoted by <sup>text</sup> and subscript <sub>text</sub>.
    • Numbered or bulleted lists can be created using either numbered lines 1. 2. 3., hyphens - or asterisks *.
    • Links can be included with: [my link to pubmed](http://pubmed.com)
    • Images can be included with: ![alt text](https://bestmedicaljournal.com/study_graph.jpg "Image Title Text")
    • For footnotes use [^1](This is a footnote.) inline.
    • Or use an inline reference [^1] to refer to a longer footnote elseweher in the document [^1]: This is a long footnote..

    hide…

Want more great medical articles?

Keep up to date with a free trial of metajournal, personalized for your practice.
1,694,794 articles already indexed!

We guarantee your privacy. Your email address will not be shared.