-
- Rohaid Ali, Oliver Y Tang, Ian D Connolly, Jared S Fridley, John H Shin, Patricia L Zadnik Sullivan, Deus Cielo, Adetokunbo A Oyelese, Curtis E Doberstein, Albert E Telfeian, Ziya L Gokaslan, and Wael F Asaad.
- Department of Neurosurgery, The Warren Alpert Medical School of Brown University, Providence , Rhode Island , USA.
- Neurosurgery. 2023 Nov 1; 93 (5): 109010981090-1098.
Background And ObjectivesGeneral large language models (LLMs), such as ChatGPT (GPT-3.5), have demonstrated the capability to pass multiple-choice medical board examinations. However, comparative accuracy of different LLMs and LLM performance on assessments of predominantly higher-order management questions is poorly understood. We aimed to assess the performance of 3 LLMs (GPT-3.5, GPT-4, and Google Bard) on a question bank designed specifically for neurosurgery oral boards examination preparation.MethodsThe 149-question Self-Assessment Neurosurgery Examination Indications Examination was used to query LLM accuracy. Questions were inputted in a single best answer, multiple-choice format. χ 2 , Fisher exact, and univariable logistic regression tests assessed differences in performance by question characteristics.ResultsOn a question bank with predominantly higher-order questions (85.2%), ChatGPT (GPT-3.5) and GPT-4 answered 62.4% (95% CI: 54.1%-70.1%) and 82.6% (95% CI: 75.2%-88.1%) of questions correctly, respectively. By contrast, Bard scored 44.2% (66/149, 95% CI: 36.2%-52.6%). GPT-3.5 and GPT-4 demonstrated significantly higher scores than Bard (both P < .01), and GPT-4 outperformed GPT-3.5 ( P = .023). Among 6 subspecialties, GPT-4 had significantly higher accuracy in the Spine category relative to GPT-3.5 and in 4 categories relative to Bard (all P < .01). Incorporation of higher-order problem solving was associated with lower question accuracy for GPT-3.5 (odds ratio [OR] = 0.80, P = .042) and Bard (OR = 0.76, P = .014), but not GPT-4 (OR = 0.86, P = .085). GPT-4's performance on imaging-related questions surpassed GPT-3.5's (68.6% vs 47.1%, P = .044) and was comparable with Bard's (68.6% vs 66.7%, P = 1.000). However, GPT-4 demonstrated significantly lower rates of "hallucination" on imaging-related questions than both GPT-3.5 (2.3% vs 57.1%, P < .001) and Bard (2.3% vs 27.3%, P = .002). Lack of question text description for questions predicted significantly higher odds of hallucination for GPT-3.5 (OR = 1.45, P = .012) and Bard (OR = 2.09, P < .001).ConclusionOn a question bank of predominantly higher-order management case scenarios for neurosurgery oral boards preparation, GPT-4 achieved a score of 82.6%, outperforming ChatGPT and Google Bard.Copyright © Congress of Neurological Surgeons 2023. All rights reserved.
Notes
Knowledge, pearl, summary or comment to share?You can also include formatting, links, images and footnotes in your notes
- Simple formatting can be added to notes, such as
*italics*
,_underline_
or**bold**
. - Superscript can be denoted by
<sup>text</sup>
and subscript<sub>text</sub>
. - Numbered or bulleted lists can be created using either numbered lines
1. 2. 3.
, hyphens-
or asterisks*
. - Links can be included with:
[my link to pubmed](http://pubmed.com)
- Images can be included with:
![alt text](https://bestmedicaljournal.com/study_graph.jpg "Image Title Text")
- For footnotes use
[^1](This is a footnote.)
inline. - Or use an inline reference
[^1]
to refer to a longer footnote elseweher in the document[^1]: This is a long footnote.
.